Whitespace can confuse StringMatchingRecommender

inception-project / inception

INCEpTION provides a semantic annotation platform offering intelligent annotation assistance and knowledge management.

https://inception-project.github.io

Apache License 2.0

593 stars 151 forks source link

Whitespace can confuse StringMatchingRecommender #910

Closed reckart closed 5 years ago

reckart commented 5 years ago

Describe the bug If the StringMatchingRecommender learns that John Smith is a person, it won't predict that John\tSmith is a person as well.

To Reproduce Steps to reproduce the behavior:

Import John Smith John\tSmith and test.

Expected behavior The kind of whitespace which separates tokens should not confuse the recommender.

Please complete the following information:

Version and build ID: 4a9f918d75a0ac64af72dbc61b0398c61860f8e2

jcklie commented 5 years ago

Dont we normalize whitespace somewhere?

reckart commented 5 years ago

Not at the level of the document text that is in the CAS. The CAS contains whatever whitespace the original document contained (and which passed through the DKPro Core Reader). We do a bit of normalization e.g. when sending text to brat.

reckart commented 5 years ago

Note, for other recommenders, this shouldn't be a problem because they operate on Tokens and Tokens normally don't include whitespace. But the StringMatchingRecommender operates directly on the CAS document text.

jcklie commented 5 years ago

What happens if we would introduce the sentence level recommender in #590 ? Do we get the bad whitespace from the sentence?

reckart commented 5 years ago

Well - normally a recommender would operate on tokens, not on the base text. But for the StringMatchingRecommender, it is actually easier applying the Trie directly to the base text instead of first constructing a string from the tokens.

reckart commented 5 years ago

@Rentier for a sentence-level recommender cf. OpenNlpDoccatRecommender.