nytud / emtsv

e-magyar text processing system -- inter-module communication via tsv + REST API
GNU Lesser General Public License v3.0
27 stars 11 forks source link

Handle CoNLL-U comments #27

Open DavidNemeskey opened 3 years ago

DavidNemeskey commented 3 years ago

emtsv does not handle CoNLL-U comments very well. If the input is a tsv file, two things happen:

  1. If the file only has the form column, comments (lines starting with "#") are treated as a token and are analyzed as a single "word" token
  2. If the file has other columns (e.g. form anas lemma xpostag to which I want to add upostag feats), only the new header is returned.

Expected behavior: comments should be kept in the text and returned as-is, and they should not prevent emtsv to analyze the text (as in the second case).

dlazesz commented 3 years ago

CoNLL-U comments need to be explicitly enabled with conllu-comments parameter. We may flip the default behaviour to enabled in some future release.

I agree that the documentation is very coarse on this.

DavidNemeskey commented 3 years ago

Yes, I think it would make sense if that was the default. Should I do it in a PR (+ add a sentence about it to the docs)?

dlazesz commented 3 years ago

Specifiing this in the docs is ok, but changing the default in xtsv requires new major version at least in xtsv. These breaking changes should be commited in batches to minimise disruption. (We have others in mind.)

@mittelholcz What do you think?