mkabbasi / cleartk

Automatically exported from code.google.com/p/cleartk
0 stars 0 forks source link

consider switching to UIMA resources for classifiers, etc. #393

Open GoogleCodeExporter opened 8 years ago

GoogleCodeExporter commented 8 years ago
We should talk about whether or not we should switch over to the UIMA way of 
doing things for stuff like CleartkAnnotator:

http://mail-archives.apache.org/mod_mbox/uima-user/201311.mbox/%3CB87CC687-68E8-
47C8-91B0-78F8DBEBCBC4%40apache.org%3E

Switching to the UIMA way would mean that instead of:

AnalysisEngineFactory.createPrimitiveDescription(
    ExamplePOSAnnotator.class,
    CleartkSequenceAnnotator.PARAM_IS_TRAINING,
    true,
    DirectoryDataWriterFactory.PARAM_OUTPUT_DIRECTORY,
    outputDirectory,
    DefaultSequenceDataWriterFactory.PARAM_DATA_WRITER_CLASS_NAME,
    MalletCRFStringOutcomeDataWriter.class);

We would do something like:

AnalysisEngineFactory.createPrimitiveDescription(
    ExamplePOSAnnotator.class,
    CleartkSequenceAnnotator.PARAM_IS_TRAINING,
    true,
    CleartkSequenceAnnotator.PARAM_DATA_WRITER_FACTORY,
    ExternalResourceFactory.createExternalResourceDescription(
        DefaultSequenceDataWriterFactory.class,
        DefaultSequenceDataWriterFactory.PARAM_DATA_WRITER,
        ExternalResourceFactory.createExternalResourceDescription(
            MalletCRFStringOutcomeDataWriter.class,
            MalletCRFStringOutcomeDataWriter.PARAM_OUTPUT_DIRECTORY,
            outputDirectory))

This is a bit more verbose, but it does make it clearer where the configuration 
parameters come from, since they're all scoped by their external resource 
grouping. Also, with this approach, if you load the same classifier in more 
than one place, it will only be loaded once if you use the same 
`ExternalResourceDescription` in both places. (But using the same classifier 
twice is probably uncommon.)

If we went this route, it would require some substantial changes. 
`DataWriterFactory`, `DataWriter`, `ClassifierFactory` and `Classifier` would 
have to implement `SharedResourceObject` instead of `Initializable`. Note that 
this would involve removing all the `XXX(File)` constructor in `DataWriter`s, 
and adding an `initialize(UimaContext)` method to `DataWriter_ImplBase`.

Of course, this would be backwards compatible, and a change to some of the core 
ML APIs. So either we do this for 2.0, or we don't do it until 3.0.

Original issue reported on code.google.com by steven.b...@gmail.com on 16 Nov 2013 at 6:08

GoogleCodeExporter commented 8 years ago
It seems like the main reason for using external resources has to do with the 
ability to share an expensive resource between analysis engines.  If that is an 
uncommon scenario in a CleartkAnnotator, then why should we make such a big API 
change?  Perhaps it would make sense for TF-IDF data or a language model.  But 
I'd hate to see this hang up a 2.0 release.  

Original comment by phi...@ogren.info on 8 Dec 2013 at 3:23

GoogleCodeExporter commented 8 years ago
It is a matter of separating concerns. With the external resources, you can 
define parameters where they belong and set them where they are defined. No 
need to manually tunnel them through any AEs. 

The approach of prefixing parameter names with the class name does not work in 
cases where you want to pass the same class twice to an AE but with different 
parameters. 

We had been talking longer about this in 
http://code.google.com/p/uimafit/issues/detail?id=70

Original comment by richard.eckart on 8 Dec 2013 at 6:46

GoogleCodeExporter commented 8 years ago
Btw. this should also eventually lead into addressing 
https://code.google.com/p/uimafit/issues/detail?id=7 

Original comment by richard.eckart on 8 Dec 2013 at 6:51

GoogleCodeExporter commented 8 years ago
I believe you already switched to uimaFIT 2.0.0. Did that break this 
Initializable stuff for you? I suppose you could go on using the 
classname-prefixed fields and Initializable, but the factory which was used to 
create the parameter names would need to go to ClearTK, because it has been 
removed from uimaFIT 2.0.0. In fact, I would also like to remove the whole 
Initializable stuff at some point. If the current way of using the external 
resources doesn't meet your approval, we should continue to discuss the 
requirements. I have ideas for further improving that, but currently no time to 
work on it.

Original comment by richard.eckart on 8 Dec 2013 at 6:56

GoogleCodeExporter commented 8 years ago
We did switch to 2.0.0 and we switched to the simple parameter naming scheme. 
But we still use Initializable.

Original comment by steven.b...@gmail.com on 8 Dec 2013 at 10:48

GoogleCodeExporter commented 8 years ago

Original comment by phi...@ogren.info on 15 Mar 2014 at 5:43