ufal / treex

Treex NLP framework
33 stars 6 forks source link

Czech analysis merged to one file again #60

Closed tuetschek closed 7 years ago

tuetschek commented 7 years ago

The structure now follows the English analysis (with parameters tokenizer, tagger, ner, parser, and tecto that may be set to none to disable the given part of the analysis).

The original Analysis::CS::[AMNT] modules now use the joint module with parameters.

tuetschek commented 7 years ago

@ptakopysk and @martinpopel : could you please check this?

martinpopel commented 7 years ago

Great, thanks! I had this on my todo-list for a while.

ptakopysk commented 7 years ago

Thx, LGTM.

However, you changed the semantics in case of Analysis::CS::M, which in fact was rather W+M (tokenizer & tagger), but now it is only M (no tokenizer, just tagger). I have just updated my code not to use these M and A scenarios at all, to be on the safe side, but if someone else uses Analysis::CS::M, this commit has broken his workflow (I don't think anyone uses it, but I don't know how to find out whether this is true).

I have several suggestions for a fix:

  1. put tokenization back into Analysis::CS::M (i.e. remove "tokenizer=none" from it), or
  2. add Analysis::CS::W doing the tokenization, or
  3. ignore this because nobody is probably affected, or even
  4. remove these M, A, N, T blocks because nobody probably uses them now (and the introduced parametrization is a clean and easy way to achieve the same thing)

I would probably vote for 4., I have no personal attachment to these blocks, even though I had authored them ;-) However, if we decide to keep M as it is now, at least the Description section of its documentation should be updated accordingly.

martinpopel commented 7 years ago

Any solution 1-4 LGTM is OK for me, probably also slightly preferring 4 (Why have it only for CS and not for other languages? But introducing these subscenarios files for each language and keeping POD in sync is some extra work).

In future (if someone has time), we could implement tokenizer=auto, that is the tokenization block would exit if it detects the tokens (a-tree with a-nodes) are already there.

tuetschek commented 7 years ago

Oops, you're right. I only kept the Analysis::CS::{M,A,N,T} blocks so that I wouldn't break your code, @ptakopysk . Looks like I managed to break it anyway... OK, I'll delete them shortly :-).