vnadgir / dkpro-core-asl

Automatically exported from code.google.com/p/dkpro-core-asl
0 stars 0 forks source link

Lemmatizers should not assign null to Lemma.value #326

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
When some lemmatizers encounter a token covering the text "_" (as produced by, 
for example, BreakIteratorSegmenter), they set Lemma.value to null.  Annotators 
further down the pipeline which process Lemma annotations aren't expecting a 
null value here and may throw exceptions or otherwise misbehave.

In such cases the lemmatizer should instead set the value of Lemma.value to the 
covered text.

This is a tracking bug for issues involving individual lemmatizers in DKPro 
Core.

Original issue reported on code.google.com by tristan.miller@nothingisreal.com on 22 Jan 2014 at 1:32

GoogleCodeExporter commented 9 years ago

Original comment by tristan.miller@nothingisreal.com on 22 Jan 2014 at 1:32

GoogleCodeExporter commented 9 years ago
ClearNLP Lemmatizer does not have this issue with underscores.

Original comment by pedrobss...@gmail.com on 24 Jan 2014 at 10:48

GoogleCodeExporter commented 9 years ago
This issue is not about underscores, it is about the lemmatizer returning 
"null" in some cases because it may not know how to lemmatize a certain word.

Original comment by richard.eckart on 25 Jan 2014 at 12:20

GoogleCodeExporter commented 9 years ago
In the description: "When some lemmatizers encounter a token covering the text 
"_" [...]"

Anyway, clearnlp does not assign null for underscores.

Original comment by pedrobss...@gmail.com on 25 Jan 2014 at 12:32

GoogleCodeExporter commented 9 years ago
The underscore problem is probably specific to the Stanford lemmatizer.  For 
most other words it can't lemmatize it just defaults to the covered text.  
However, it and other lemmatizers may have a similar problem with other edge 
cases.  The fix I applied to StanfordLemmatizer will probably work with every 
other lemmatizer; I just did something like

if (lemma.value == null)
  lemma.value = token.getCoveredText();

(Don't have my development environment in front of me so this probably isn't 
exactly what I wrote, but you get the picture.)

Original comment by tristan.miller@nothingisreal.com on 25 Jan 2014 at 11:45

GoogleCodeExporter commented 9 years ago
Need to check if all lemmatizers properly check for null values and if so set 
the covered text as lemma.

Original comment by richard.eckart on 6 Aug 2014 at 8:36

GoogleCodeExporter commented 9 years ago

Original comment by eriklan.dodinh@gmail.com on 15 Aug 2014 at 9:05

GoogleCodeExporter commented 9 years ago
This issue was updated by revision r488 
(https://code.google.com/p/dkpro-core-gpl/source/detail?r=488).

- Ensured non-null (more spec. Token.getCoveredText()) Lemma.value for GPL 
components (GateLemmatizer, MateLemmatizer, SfstAnnotator)

Original comment by eriklan.dodinh@gmail.com on 15 Aug 2014 at 9:34

GoogleCodeExporter commented 9 years ago
This issue was updated by revision r2730.

- Ensured non-null (more spec. Token.getCoveredText()) Lemma.value for ASL 
components (ClearNlpLemmatizer, CogrooLemmatizer, MeCabTagger, MorphaLemmatizer)
- Non-null already ensured for LanguageToolLemmatizer, TreeTagger (and 
TokenMerger)

Original comment by eriklan.dodinh@gmail.com on 15 Aug 2014 at 10:17

GoogleCodeExporter commented 9 years ago

Original comment by eriklan.dodinh@gmail.com on 15 Aug 2014 at 10:23

GoogleCodeExporter commented 9 years ago
This issue was updated by revision r2747.

Merging into 1.6.x branch

Original comment by pedrobss...@gmail.com on 20 Aug 2014 at 9:31

GoogleCodeExporter commented 9 years ago
This issue was fixed by revision 
https://code.google.com/p/dkpro-core-gpl/source/detail?r=497

Original comment by pedrobss...@gmail.com on 20 Aug 2014 at 9:46