ClearNLP wrappers should support option of using different classes for Tokens and Sentences

GoogleCodeExporter commented 9 years ago

Currently the ClearNLP tokenizer wrapper is hard-coded to operate over windows 
of org.cleartk.type.Sentence and to create instances of org.cleartk.type.Token. 
 While, it is possible to add configuration parameters to specify the window 
and token classes, Steve Bethard offers the following, more general solution:

I think this is the wrong direction to go. If we want type-system 
agnostic code, we can't assume that each token (or sentence, or 
whatever) corresponds to a single annotation subclass in every type 
system. Take the cTAKES type system for example - there are multiple 
types of tokens depending on whether they're punctuation, or words, or 
something else. So sprinkling around "private Class<? extends 
Annotation> xxxClass" is not a good general solution. 

Instead, each of our annotators should have a base class that declares 
the abstract methods that it needs to perform its work. For example, 
most of TokenAnnotator should be moved into a new class, 
TokenAnnotator_ImplBase<WINDOW_TYPE extends Annotation>, that declares 
two methods: 

    protected abstract Collection<WINDOW_TYPE> selectWindows(JCas jCas); 
    protected abstract void addToken(JCas jCas, int begin, int end); 

Then, TokenAnnotator would just extend TokenAnnotator_ImplBase and 
define these two methods as appropriate for the ClearTK type system. 
And if you needed to work with a different type system in a different 
project, you'd just create a different subclass of 
TokenAnnotator_ImplBase that implemented those methods as necessary.

Original issue reported on code.google.com by lee.becker on 3 Apr 2013 at 10:34

GoogleCodeExporter commented 9 years ago

Original comment by steven.b...@gmail.com on 17 Apr 2013 at 12:12

Added labels: Milestone-2.0

GoogleCodeExporter commented 9 years ago

Did anyone have any feedback on the code in branch issue-359?

Original comment by lee.becker on 7 May 2013 at 6:02

GoogleCodeExporter commented 9 years ago

Sorry for not giving some feedback earlier. Here are my thoughts:

* Rather than have the getXXXOps methods, I think I'd just put the necessary 
Ops methods in the constructor of the _ImplBase class. Then all subclasses of 
the _ImplBase just provide a no-args constructor that passes the correct XXXOps 
instances to the _ImplBase constructor.

* I don't think we should name anything SRL if we don't have to. SRL has too 
many meanings. We should just call them SemanticRoleOps, etc.

* Things like DEPArc shouldn't show up anywhere in the Ops classes. The 
eventual plan is to use these Ops classes across the different components, so 
they shouldn't depend on anything ClearNLP-specific.

* SemanticRoleOps shouldn't assume that a predicate or an argument are exactly 
one token in length. For PropBank, arguments at least are almost always longer 
than that.

* DependencyOps shouldn't assume that there's a single head. Only 
DependencyParser_ImplBase, etc. should assume that.

I just pushed a bunch of changes that address these issues. If the results look 
good to you, feel free to merge the branch back into master. (And if you do so 
before Philip makes a release (this weekend?), please update the Milestone on 
this ticket to 1.4.)

Original comment by steven.b...@gmail.com on 8 May 2013 at 11:44

GoogleCodeExporter commented 9 years ago

Original comment by lee.becker on 8 May 2013 at 2:42

Changed state: Fixed
Added labels: Milestone-1.4
Removed labels: Milestone-2.0

noushadali / cleartk

ClearNLP wrappers should support option of using different classes for Tokens and Sentences #359