GUM format writer does not check that cluster ids do not contain hyphens

dan-zeman commented 2 years ago

I ran this:

udapy -s read.OldCorefUD corefud.FixInterleaved < en_pcedt-ud-train.conllu > newtrain.conllu
udapy -s read.OldCorefUD corefud.FixInterleaved < en_pcedt-ud-dev.conllu > newdev.conllu
udapy -s read.OldCorefUD corefud.FixInterleaved < en_pcedt-ud-test.conllu > newtest.conllu
validate.py --lang en --level 2 --coref new*.conllu

And got this:

[(in newdev.conllu) Line 6 Sent wsj-0001-s1]: [L6 Coref too-many-entity-attributes] Entity 'wsj-0001-c1--2' has 5 attributes while only 4 attributes are globally declared.
[(in newdev.conllu) Line 6 Sent wsj-0001-s1]: [L6 Coref spurious-entity-type] Spurious entity type '0001'.
[(in newdev.conllu) Line 6 Sent wsj-0001-s1]: [L6 Coref spurious-mention-head] Entity head index 'c1' must be a non-zero-starting integer.
...

The problem is that the cluster ids in the input format contain hyphens (e.g., "wsj-0001-c1"). One possibility would be to replace hyphens by empty strings or underscores or something else; but then one should check that the new id is not actually used elsewhere in the corpus, so perhaps it would be better to forcefully normalize cluster ids after reading via OldCorefUD, using the code that already is available in Udapi. The third option would be to simply throw a fatal error when trying to save in the GUM format entities with such ids; then it would be up to the caller to take care of them.

martinpopel commented 2 years ago

We have never released IDs such as "wsj-0001-c1", both CorefUD 0.1 and 0.2 use c1, c2,... It seems your input files are from some intermediate step in the conversion and you need to apply corefud.IndexClusters to get IDs without hyphens.

That said, I agree Udapi should never produce invalid CoNLL-U files (to such extent that it cannot load them). Throwing fatal errors in the writer seems too late (and it can make debugging more difficult because you cannot inspect the invalid file if the writer fails).

My suggestion is:

read.OldCorefUD will have a parameter substitute_hyphen_in_id_for. Its default value will be probably empty string, but one can use e.g. an underscore or -.
CorefCluster.__init__ will check that the provided ID (currently still called cluster_id, but I plan to rename it to eid) does not contain hyphens and other forbidden characters (=, | and any whitespace). Or do we want to restrict the allowed characters to alphanum and underscore?
CorefCluster.cluster_id is currently a read-only property (so corefud.IndexClusters edits cluster._cluster_id directly), but I can make it writable and check the forbidden characters.
write.Conllu (or rather the implementation in udapi.core.coref.store_coref_to_misc()) can delete all hyphens from IDs (potentially merging e.g. "a-1" and "a1" into the same cluster) and report a warning.

dan-zeman commented 2 years ago

We have never released IDs such as "wsj-0001-c1", both CorefUD 0.1 and 0.2 use c1, c2,... It seems your input files are from some intermediate step in the conversion and you need to apply corefud.IndexClusters to get IDs without hyphens.

It is true that we have not released such data, yet they were valid in the old format and in the old API. This is what I was exporting from P(CE)DT, and I will make sure it does not contain hyphens in future exports (like I removed other special characters, such as | , : + in the previous version). This issue was meant to alert you that Udapi should not pass it to the output.

I do not know if your suggestion is to do one of the points, or all of them. The optional fix in read.OldCorefUD, and the hard fix with warning in write.Conllu would be enough in my opinion.

I would still allow as many characters as possible in the ids.

udapi / udapi-python

GUM format writer does not check that cluster ids do not contain hyphens #97