Common format for (sentence) tokenizers

DavidNemeskey commented 7 years ago

Currently we have two (sentence-)tokenizers:

QunToken (hu.nytud.gate.tokenizers.QunTokenCommandLine)
the one in MagyarLanc (com.precognox.kconnect.gate.magyarlanc.HungarianTokenizerSentenceSplitter)

It would be nice if their output would be in the same format. These are the differences I have found thus far:

Difference	QunToken	MagyarLanc
Sentence tag	After sentence tokens	Before sentence tokens
Sentence features	`length`, `string`	nothing
Word features	`kind`, `length`, `string`	`length`, `string`
`SpaceToken.string`	preserves the token	always empty

Some of these differences I can understand, such as the lack of the kind feature for MagyarLanc, as it doesn't care whether the token is a word or punctuation at the tokenization phase. As for the rest, I suggest we come up with a standardized format and make the tool wrappers conform. For the first item, I would go with how QunToken does it and put the Sentence tag at the end to support stream parsers. For the rest, some input would be welcome.

sassbalint commented 7 years ago

Thanks for this useful comparison.

Someday there might be something about this issue.

I think that MagyarLanc tokenizer can be considered as obsolete, one of the reasons to create a new tokenizer was to preserve SpaceTokens.

Meanwhile, in principle, we are open to pull-requests.

DavidNemeskey commented 7 years ago

@sassbalint See #11.

nytud / hunlp-GATE

Common format for (sentence) tokenizers #9