Closed DavidNemeskey closed 7 years ago
Thanks for this useful comparison.
Someday there might be something about this issue.
I think that MagyarLanc tokenizer can be considered as obsolete, one of the reasons to create a new tokenizer was to preserve SpaceTokens.
Meanwhile, in principle, we are open to pull-requests.
@sassbalint See #11.
Currently we have two (sentence-)tokenizers:
hu.nytud.gate.tokenizers.QunTokenCommandLine
)com.precognox.kconnect.gate.magyarlanc.HungarianTokenizerSentenceSplitter
)It would be nice if their output would be in the same format. These are the differences I have found thus far:
length
,string
kind
,length
,string
length
,string
SpaceToken.string
Some of these differences I can understand, such as the lack of the
kind
feature for MagyarLanc, as it doesn't care whether the token is a word or punctuation at the tokenization phase. As for the rest, I suggest we come up with a standardized format and make the tool wrappers conform. For the first item, I would go with how QunToken does it and put the Sentence tag at the end to support stream parsers. For the rest, some input would be welcome.