udapi / udapi-python

Python framework for processing Universal Dependencies data
GNU General Public License v3.0
55 stars 30 forks source link

GUM format writer does not check that cluster ids do not contain hyphens #97

Closed dan-zeman closed 2 years ago

dan-zeman commented 2 years ago

I ran this:

udapy -s read.OldCorefUD corefud.FixInterleaved < en_pcedt-ud-train.conllu > newtrain.conllu
udapy -s read.OldCorefUD corefud.FixInterleaved < en_pcedt-ud-dev.conllu > newdev.conllu
udapy -s read.OldCorefUD corefud.FixInterleaved < en_pcedt-ud-test.conllu > newtest.conllu
validate.py --lang en --level 2 --coref new*.conllu

And got this:

[(in newdev.conllu) Line 6 Sent wsj-0001-s1]: [L6 Coref too-many-entity-attributes] Entity 'wsj-0001-c1--2' has 5 attributes while only 4 attributes are globally declared.
[(in newdev.conllu) Line 6 Sent wsj-0001-s1]: [L6 Coref spurious-entity-type] Spurious entity type '0001'.
[(in newdev.conllu) Line 6 Sent wsj-0001-s1]: [L6 Coref spurious-mention-head] Entity head index 'c1' must be a non-zero-starting integer.
...

The problem is that the cluster ids in the input format contain hyphens (e.g., "wsj-0001-c1"). One possibility would be to replace hyphens by empty strings or underscores or something else; but then one should check that the new id is not actually used elsewhere in the corpus, so perhaps it would be better to forcefully normalize cluster ids after reading via OldCorefUD, using the code that already is available in Udapi. The third option would be to simply throw a fatal error when trying to save in the GUM format entities with such ids; then it would be up to the caller to take care of them.

martinpopel commented 2 years ago

We have never released IDs such as "wsj-0001-c1", both CorefUD 0.1 and 0.2 use c1, c2,... It seems your input files are from some intermediate step in the conversion and you need to apply corefud.IndexClusters to get IDs without hyphens.

That said, I agree Udapi should never produce invalid CoNLL-U files (to such extent that it cannot load them). Throwing fatal errors in the writer seems too late (and it can make debugging more difficult because you cannot inspect the invalid file if the writer fails).

My suggestion is:

dan-zeman commented 2 years ago

We have never released IDs such as "wsj-0001-c1", both CorefUD 0.1 and 0.2 use c1, c2,... It seems your input files are from some intermediate step in the conversion and you need to apply corefud.IndexClusters to get IDs without hyphens.

It is true that we have not released such data, yet they were valid in the old format and in the old API. This is what I was exporting from P(CE)DT, and I will make sure it does not contain hyphens in future exports (like I removed other special characters, such as | , : + in the previous version). This issue was meant to alert you that Udapi should not pass it to the output.

I do not know if your suggestion is to do one of the points, or all of them. The optional fix in read.OldCorefUD, and the hard fix with warning in write.Conllu would be enough in my opinion.

I would still allow as many characters as possible in the ids.