noushadali / cleartk

Automatically exported from code.google.com/p/cleartk
0 stars 0 forks source link

TokenAnnotator.getDescription() has bad defaults #171

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
So TokenAnnotator.getDescription() uses default values for all its parameters, 
meaning:

PARAM_TOKENIZER_NAME=org.cleartk.token.util.PennTreebankTokenizer
PARAM_TOKEN_TYPE_NAME=org.cleartk.type.Token
PARAM_WINDOW_TYPE_NAME=null

This is a *bad* combination - the null PARAM_WINDOW_TYPE_NAME means that pretty 
much every period at the end of every sentence will be tokenized with the last 
word instead of separately. So this needs to be fixed one of two ways:

(1) Make PARAM_WINDOW_TYPE_NAME default to org.cleartk.type.Sentence
(2) Leave PARAM_WINDOW_TYPE_NAME as defaulting to null, but override that with 
org.cleartk.type.Sentence in TokenAnnotator.getDescription()

I'm mildly partial to the former, since the defaults as they are right now are 
just wrong. What do you guys think?

Original issue reported on code.google.com by steven.b...@gmail.com on 5 Nov 2010 at 11:26

GoogleCodeExporter commented 9 years ago
I like option 1.  In fact, I am pretty certain I tried making this change once 
upon a time, discovered that it broke some tests, changed it back, and then 
left a little note for you above the configuration parameter "windowTypeName".  
It should be easy enough to fix any unit tests that break by adding in the 
necessary sentence annotations.  

Original comment by pvogren@gmail.com on 5 Nov 2010 at 2:43

GoogleCodeExporter commented 9 years ago
btw - I am ambivalent about having getDescription methods at all, leaning 
towards against them.  Often when I use ClearTK components I am using a 
completely different type system.  Especially, in a case like this where it is 
doing very little....  

Original comment by pvogren@gmail.com on 5 Nov 2010 at 2:46

GoogleCodeExporter commented 9 years ago
It breaks one test, 
org.cleartk.token.TokenizerAndTokenAnnotatorTest.ticket176(). And the problem 
seems to be that UimaContext.getConfigParameterValue doesn't distinguish 
between a null value that means it wasn't specified, and an explicit null value 
passed in. So you can't just append:

    TokenAnnotator.PARAM_WINDOW_TYPE_NAME, null

to the argument list of AnalysisEngineFactory.createPrimitive - because this is 
treated the same as if PARAM_WINDOW_TYPE_NAME was not specified at all. If we 
still want to stick with solution (1), we'd need to do something like allow "" 
as the value that means no window types, and you'd have to specify this 
explicitly if you wanted no window annotation.

I guess another possibility is to try to catch the mistake and issue a warning, 
e.g. if PARAM_TOKENIZER_NAME=org.cleartk.token.util.PennTreebankTokenizer, then 
PARAM_WINDOW_TYPE_NAME should never be null. We could just issue an warning 
whenever we spot this kind of mistake.

Original comment by steven.b...@gmail.com on 5 Nov 2010 at 4:07

GoogleCodeExporter commented 9 years ago
So, the only way to have windowTypeName end up with a null value is to not give 
it a default value and not specify a value in the uimaContext?  Interesting!  
...and frustrating.  This sounds a bit like a possible bug in uimaFIT - but I'm 
not sure how it would handle this either. I suppose I should at least file an 
issue - because we should at least document this issue if we can't take care of 
it.  

One possibility is to pass in a different value in the test like 
"org.apache.uima.jcas.tcas.DocumentAnnotation".  Another possibility is to 
revert to option (2) above.  I think the warning above is kind of dorky - but 
probably harmless.    

Original comment by pvogren@gmail.com on 5 Nov 2010 at 4:29

GoogleCodeExporter commented 9 years ago
It's more of a bad design decision in UIMA than a bug in uimaFIT - the 
documentation for UimaContext.getConfigParameterValue(String) says:

This method returns null if the parameter is optional and has not been assigned 
a value.
http://uima.apache.org/downloads/releaseDocs/2.3.0-incubating/docs/api/org/apach
e/uima/UimaContext.html#getConfigParameterValue(java.lang.String)

UimaFIT loads the default value for a parameter if getConfigParameterValue 
returns null. But I'm not really sure what else it could do given the available 
UimaContext methods.

I like the DocumentAnnotation idea. We could then default to 
PARAM_WINDOW_TYPE_NAME=org.cleartk.type.Sentence, which would get rid of these 
really subtle extra-periods-at-the-ends-of-a-few-words bugs for people using 
the ClearTK type system. People using their own type system and who 
accidentally left PARAM_WINDOW_TYPE_NAME=org.cleartk.type.Sentence would get 
much more obvious bugs (no tokens would be created). We could put some 
documentation clearly at the top of TokenAnnotator explaining why you might bet 
such bugs and recommending 
PARAM_WINDOW_TYPE_NAME=org.apache.uima.jcas.tcas.DocumentAnnotation for these 
situations.

If this sounds okay, let me know, and I'll go ahead and make this change, and 
update the tests and documentation.

Original comment by steven.b...@gmail.com on 7 Nov 2010 at 3:04

GoogleCodeExporter commented 9 years ago
This sounds good to me.  Thanks for your attention to this!  

Original comment by pvogren@gmail.com on 8 Nov 2010 at 3:48

GoogleCodeExporter commented 9 years ago
This issue was closed by revision r2089.

Original comment by steven.b...@gmail.com on 8 Nov 2010 at 1:53

GoogleCodeExporter commented 9 years ago
Fixed in r2089.

Original comment by steven.b...@gmail.com on 8 Nov 2010 at 1:54

GoogleCodeExporter commented 9 years ago
So this is broken again. I don't know why - maybe either lost in the merge or 
broken when cleartk-syntax-opennlp was updated? Could whoever broke this please 
fix it again and add the test back in? See r2089 linked to in this issue for 
the patch.

Original comment by steven.b...@gmail.com on 26 Jan 2011 at 2:27

GoogleCodeExporter commented 9 years ago
This must have slipped back in by accident during the project re-org.  Sorry!  
It is fixed in r2361.

Original comment by phi...@ogren.info on 26 Jan 2011 at 6:25