nytud / hunlp-GATE

Lang_Hungarian - a GATE plugin containing Hungarian NLP tools as GATE processing resources
GNU General Public License v3.0
8 stars 6 forks source link

Common format for (sentence) tokenizers #9

Closed DavidNemeskey closed 7 years ago

DavidNemeskey commented 7 years ago

Currently we have two (sentence-)tokenizers:

It would be nice if their output would be in the same format. These are the differences I have found thus far:

Difference QunToken MagyarLanc
Sentence tag After sentence tokens Before sentence tokens
Sentence features length, string nothing
Word features kind, length, string length, string
SpaceToken.string preserves the token always empty

Some of these differences I can understand, such as the lack of the kind feature for MagyarLanc, as it doesn't care whether the token is a word or punctuation at the tokenization phase. As for the rest, I suggest we come up with a standardized format and make the tool wrappers conform. For the first item, I would go with how QunToken does it and put the Sentence tag at the end to support stream parsers. For the rest, some input would be welcome.

sassbalint commented 7 years ago

Thanks for this useful comparison.

Someday there might be something about this issue.

I think that MagyarLanc tokenizer can be considered as obsolete, one of the reasons to create a new tokenizer was to preserve SpaceTokens.

Meanwhile, in principle, we are open to pull-requests.

DavidNemeskey commented 7 years ago

@sassbalint See #11.