Closed GoogleCodeExporter closed 9 years ago
I like the getters/setters and am not worried about whether or not an extension
is going to directly access the map or not (we could make the map private,
too.) I don't like the idea that we have to initialize all of the data writers
and classifiers in the initialize method. If you train on 10M instances (maybe
you are doing something semi-supervised) and test on 1K, then you might find
yourself loading 100 classifiers that are never used for your 1000 test
instances.
One possible name is org.cleartk.classifier.multi.CleartkAnnotator (note
package name). Or org.cleartk.classifier.multi.MultiCleartkAnnotator. Either
way, I think it makes sense to put this code in its own package.
Original comment by phi...@ogren.info
on 3 Feb 2011 at 6:41
I like this package/name combination best
org.cleartk.classifier.multi.MultiCleartkAnnotator so that people don't look at
examples and confuse CleartkAnnotator with this new code.
To do this correctly, it looks like I will probably need to write a new
JarClassifierFactory that accepts a output directory and a name instead of just
a single path variable. Would these companion factories go in
org.cleartk.classifier.multi, org.cleartk.classifer.jar or some new place?
Also, I will need to add a createClassifier(String name) method to some
ClassifierFactory interface. Do I add to the existing ClassifierFactory
interface or spin my own MultiClassifierFactory?
Original comment by lee.becker
on 3 Feb 2011 at 8:08
I agree that it is nice to have different names.
I think a new MultiClassifierFactory should go in the 'multi' package.
Your new JarClassifierFactory should probably go in a multi.jar package - but
should still reuse as much as possible from the jar package.
Original comment by phi...@ogren.info
on 3 Feb 2011 at 8:58
I agree that it is nice to have different names.
I think a new MultiClassifierFactory should go in the 'multi' package.
Your new JarClassifierFactory should probably go in a multi.jar package - but
should still reuse as much as possible from the jar package.
Original comment by phi...@ogren.info
on 3 Feb 2011 at 8:59
I am in the process of converting CleartkAnnotatorTest into unit tests for the
CleartkMultiAnnotator. However, I'm unsure about what to do with the test
known as testDescriptor().
In CleartkAnnotatorTest, it receives ResourceInitializationExceptions that are
thrown by the initialize() method because of missing a missing output directory
or classifier jar file. However with CleartkAnnotatorTest, the dataWriters and
classifiers don't get created nor initialized until something calls
getClassifier(name) or getDataWriter(name).
Should I be trying to manually check these conditions in the initialize()
method? If so, how would I do that?
Alternatively, I can alter the unit test to check that this exception is thrown
after a call to getClassifier or getDataWriter.
Original comment by lee.becker
on 4 Feb 2011 at 4:33
As I continue to plow down this path, I'm hitting some road blocks with when it
comes to passing in a dataWriterFactory. I wrote my own class derived from
CleartkMultiAnnotator and specified the DefaultBinaryMalletDataWriterFactory
for my DataWriterFactory, only to find that all of my instances were getting
put into a single training-data.mallet file.
While I can specify the PARAM_OUTPUT_DIRECTORY as a configuration parameter, I
have no way to update this within the getDataWriter() method as there is no
method within DataWriterFactory to change the pre-specified output directory.
I believe this is less of an issue with classifiers, because I wrote a new
class called JarMultiClassifierFactory with a method createClassifier(name),
which can specify the path as needed. I could probably do something similar
and write a MultiDataWriterFactory, but then I would need to write a Multi
version of every DataWriterFactory. Alternatively, I could make it use a
DirectoryDataWriterFactory, but that would prevent CleartkMultiAnnotator from
working with several of the other types of DataWriters.
Hopefully one of you true ClearTK sages can point me in the right direction.
Original comment by lee.becker
on 5 Feb 2011 at 5:10
I see two solutions, neither of them perfect.
You could do something like what ViterbiDataWriterFactory does - cast its
UimaContext to a UimaContextAdmin, set the output directory for the data writer
you're delegating to, and then restore the original output directory when
you're done. This is a little hacky because you aren't really supposed to use
the UimaContextAdmin interface.
Alternatively, you could add getOutputDirectory and setOutputDirectory methods
to DirectoryDataWriter. Then in CleartkMultiAnnotator, you'd just cast your
DataWriter to DirectoryDataWriter and set the output directory that way. This
is a little hacky because the cast means it won't work with general DataWriters
(though it will probably work with all the ones you care about...)
Original comment by steven.b...@gmail.com
on 5 Feb 2011 at 9:16
Not really able to discern which is the lesser of two evils, I am inclined to
use the second solution, but I think it's even hackier than you described. For
many dataWriters, much of the configuration takes place in the constructor.
Even if I was to add a setOutputDirectory() method, the call would occur too
late to enact any change. Playing with the code a bit, I've found that I can
do an unchecked cast on my DataWriterFactory to a DirectoryDataWriter and then
call its setOutputDirectory prior to making calls to createDataWriter().
Again this approach is hacky because it means the class won't work with general
DataWriterFactories and consequently general DataWriters.
Original comment by lee.becker
on 5 Feb 2011 at 7:53
I thought about this a bit more, and I'm a little less worried about it not
working with general DataWriterFactories because the moment you refer to
PARAM_OUTPUT_DIRECTORY (which you *have* to do for your use case), you're
already assuming a DirectoryDataWriterFactory. So I think casting to one is
fine. Just document somewhere that this code assumes a
DirectoryDataWriterFactory or subclass.
Original comment by steven.b...@gmail.com
on 6 Feb 2011 at 9:37
I do not see any reason that CleartkMultiAnnotator has to know any thing about
the data writer factory except that it returns a data writer for a given
type/label/name for the map of data writers. The data writer factory for
CleartkMultiAnnotator is going to be a new interface with one method:
public DataWriter<OUTCOME_TYPE> createDataWriter(String name)
The implementation should be able to delegate to existing data writer factories
by updating the output directory. That is, go ahead and add setOutputDirectory
to the DirectoryDataWriter - but don't have CleartkMultiAnnotator call this
method - have the MultiAnnotatorDataWriterFactory (or whatever it is called) do
this. This way MultiCleartkAnnotator is not coupled to DirectoryDataWriter.
Original comment by phi...@ogren.info
on 7 Feb 2011 at 5:07
I'm not sure writing a MultiDataWriterFactory is necessarily the right
approach. Currently the dataWriterFactory is passed in as a configuration
parameter to CleartkMultiAnnotator. If I was to create a separate
MultDataWriterFactory, I would have to create a Multi version for every
DataWriterFactory, thus to use the DefaultBinaryMalletDataWriterFactory, I
would have to create a DefaultBinaryMalletMultiDataWriterFactory that inherits
from MultiDataWriterFactory.
Original comment by lee.becker
on 7 Feb 2011 at 5:23
Hmm.... it just occurred to me that I could create a MultiDataWriterFactory
that took as a regular DataWriterFactory class as a configuration parameter.
Does that make sense? Is it possible to chain factories like that?
Original comment by lee.becker
on 7 Feb 2011 at 5:23
"The implementation should be able to delegate to existing data writer
factories by updating the output directory" - use of PARAM_OUTPUT_DIRECTORY
means that the MultiAnnotatorDataWriterFactory implementation will have to
assume that the existing data writer factories have a PARAM_OUTPUT_DIRECTORY
(i.e. that they're a subclass of DirectoryDataWriter). So you've just moved the
assumption from MultiCleartkAnnotator to MultiAnnotatorDataWriterFactory.
That's okay with me - but we should clearly document this assumption somewhere,
regardless of whether it ends up in MultiCleartkAnnotator or
MultiAnnotatorDataWriterFactory.
Original comment by steven.b...@gmail.com
on 7 Feb 2011 at 5:30
re 12 - yes, that makes sense - exactly what I was thinking. Although I don't
think of it as "chaining" - I think of it as delegation.
I am fine with MultiAnnotatorDataWriterFactory being coupled with
DirectoryDataWriter - much preferred over having MultiCleartkAnnotator.
This really highlights the fact that what we should be careful naming these
classes. There's a few different names used above. Here is my suggestion -
name the interface MultiDataWriterFactory and name its one implementation
MultiDirectoryDataWriterFactory.
Original comment by phi...@ogren.info
on 7 Feb 2011 at 5:57
The code is checked into revision 2430. Thanks for all the help and feedback.
Original comment by lee.becker
on 7 Feb 2011 at 11:15
Lee,
This is really great. I have immediate use for this as I port my coordination
code to cleartk-syntax-coordination.
I have reopened this issue to address a few things:
1) You have touched on some of the real ugliness of Java generics in this (do
you feel diseased!?) I get this stuff messed up all the time. I don't think
you actually made a mistake, per se - but I would suggest making the member
variables multiDataWriterFactory and multiClassifierFactory typed with
OUTCOME_TYPE instead of '?'. See the attached. This causes one test to fail -
but that's because the test is wrong with this version.
2) We should revisit what exception getClassifier and getDataWriter should
throw. ResourceInitializationException seems conceptually correct. However,
it will always be called inside a subclass preprocess method and so it will
have to get caught and an AnalysisEngineProcessException thrown. So, I would
vote that these methods throw AnalysisEngineProcessException.
3) What happens if the data writer factory is instantiated but getClassifier()
is called? I suppose it blows up because a NullPointerException is thrown. I
guess that is fine by me. Still, we might consider throwing an exception
instead to ease debugging.
That's all I noticed. I only looked at CleartkMultiAnnotator - so I have no
critique of the other classes yet.
Original comment by phi...@ogren.info
on 8 Feb 2011 at 12:14
Attachments:
I added license info to two files in r2431.
Original comment by phi...@ogren.info
on 8 Feb 2011 at 12:17
Original comment by phi...@ogren.info
on 12 Feb 2012 at 5:05
I think we may need to address comment #16 before closing this issue. I'm
reopening so that it won't get lost....
Original comment by phi...@ogren.info
on 12 Feb 2012 at 10:01
Original comment by steven.b...@gmail.com
on 24 Jul 2012 at 5:58
We should decide whether to keep this. If it's not useful, we may want to
deprecate it.
Original comment by steven.b...@gmail.com
on 31 Jul 2012 at 10:07
Fixed in r3968.
Original comment by lee.becker
on 6 Aug 2012 at 7:38
Original issue reported on code.google.com by
phi...@ogren.info
on 3 Feb 2011 at 6:31