Suggestion for dealing with tokenisation mismatches

ftyers commented 8 years ago

Hey all! I had a thought while waiting at the airport day about how we could deal with mismatches in tokenisation and have something that will work for both Turkic and Chinese (and for other languages).

The basic idea is that instead of relying on matching segments of surface forms, we use the surface forms only to delimit the character ranges that syntactic words can be found in.

This isn't quite a concrete proposal yet (I'm going to try and do an implementation), but I thought I'd get it out there early to see if people think I might be on to something, or if it is not worth pursuing.

So, suppose you have the sentence: "Bu ev mavidi." 'This house blue was.' [1]

The gold standard annotation is:

1   Bu  bu  DET DET _   2   det
2   ev  ev  NOUN    NOUN    Case=Nom    3   nsubj
3-4 mavidi  _   _   _   _   _   _
3   mavi    mavi    ADJ ADJ _   0   root
4   _   i   VERB    VERB    Tense=Past  3   cop
5   .   .   PUNCT   PUNCT   _   3   punct

But your tokeniser might produce (a):

1   Bu  bu  DET DET _   2   det
2   ev  ev  NOUN    NOUN    Case=Nom    3   nsubj
3-4 mavidi  _   _   _   _   _   _
3   mavi    mavi    ADJ ADJ _   0   root
4   di  i   VERB    VERB    Tense=Past  3   cop
5   .   .   PUNCT   PUNCT   _   3   punct

or even (b):

1   Bu  bu  DET DET _   2   det
2   ev  ev  NOUN    NOUN    Case=Nom    3   nsubj
3   mavi    mavi    ADJ ADJ _   0   root
4   di  i   VERB    VERB    Tense=Past  3   cop
5   .   .   PUNCT   PUNCT   _   3   punct

What we are really interested in is the syntactic words and their relations, but we don't want to count any word twice. We can use the character ranges in the surface forms of the gold standard to delimit the character ranges in which the syntactic words should be found. So for example,

Gold standard "bu|ev|mavidi|."

0-2  [(bu, DET, 2, det)]
2-4 [(ev, NOUN, 3, nsubj)]
4-10 [(mavi, ADJ, 0, root), (i, VERB, 3, cop)]
10-11 [(., PUNCT, 3, punct)]

(b) "bu|ev|mavi|di|."

0-2  [(bu, DET, 2, det)]
2-4 [(ev, NOUN, 3, nsubj)]
4-8 [(mavi, ADJ, 0, root)]
8-10 [(i, VERB, 3, cop)]
10-11 [(., PUNCT, 3, punct)]

As 4-8 and 8-10 fall within the range 4-10, both of those syntactic words would match, without having to rely on substring matching of the surface form.

Caveats: 1) The heads would also need to be expressed with character ranges, but could be done the same way 2) It would be necessary for the lemma field to be there to do the matching (but I don't think this is entirely unreasonable, for those treebanks that don't have lemmas, they could be added automatically, or the surface form could be used).

Apologies to Turkish speakers if this is unidiomatic.

martinpopel commented 8 years ago

Note that a related issue is discussed in the proposal:

Even if the system recognizes zur as contraction but outputs wrong syntactic word forms, the tokens will be considered incorrect

There is also a comment by @dan-zeman:

I am not sure that we want to do this. German contractions are a closed class and straightforward. But in Arabic or Tamil there may be recognizable contractions where one part is an OOV word. The system has no chance of knowing its full form but it may still be able to say that it is a noun and attach it accordingly.

At least we should disregard case in comparing the word forms, as it is unclear whether ZUM should be ZU DEM, Zu dem, zu dem or something else.

I think @foxik has implemented the evaluation script (including the alignment of gold and predicted words, where the words inside multi-word tokens are aligned using a LCS-like heuristic) and will post a link to the C++ code, so we can test it on various example sentences and parser outputs.

I suggest to reformulate this issue as what should be changed in the existing proposal for evaluation (and it's implementation) because now I am not sure how it actually differs (what is the main problem).

ftyers commented 8 years ago

Yes, I suppose I have either misunderstood something, or been unclear. Could you clarify: The LCS-like heuristic... does this work on surface forms, or on syntactic words (or something else) ?

It would be great to see the example code :)

foxik commented 8 years ago

In your example, the di form will not match neither in (a) nor in (b), because of the missing FORM (second column). Also note that (a) would match to (b) perfectly.

As for using the lemma to do the matching -- if we used lemma to do the matching inside multiword tokens (i.e., as in the current proposal, except LEMMAs would be used instead of FORMs)r, that would perfectly align both (a) and (b). However, the participants are not required to perform lemmatization at all -- and requiring it just for evaluation does not seem a good idea to me.

If you are interested in code, I have the the first version in UDPipe -- that is definitely not the evaluation script we will use (we will probably have some Python script) and it cannot evaluate a given file (only a UDPipe model); however, the best_alignment method finds the alignment in the same way as is proposed in the current proposal: https://github.com/ufal/udpipe/blob/0ae33452e8adde69f9b20d22236ecfcdee71e32b/src/model/evaluator.cpp#L276. Note that it is quite tricky and not well commented (and may be even buggy at this point).

foxik commented 7 years ago

@ftyers Are you planning to work on this?

To sum up, the current proposal is to use the ranges of the surface forms, and:

two non-multiword tokens match if they have the same range
words inside multiword tokens are matched according to FORM column, using LCS (which is executed separately for each group of overlapped multiword tokens)

In your example, the FORM column is not filled in the gold data, so it would not match the di word present in (a) and (b).

However, as noted above, I think we cannot rely on lemmas for the matching of the syntactic words, because the lemmas and POS tags are not required from the participants. In theory, we could match the words inside multiword tokens without FORM column, just in a way which would result in the highest LAS (or other meric) score. That would allow us to circumvent the missing FORMs in the gold data -- however, it is definitely a bad idea. (The alignment would depend on the dependency edges, so it would not be possible to compute it for example when a POS tagger would output words with only POS tags).

foxik commented 7 years ago

Closing, as we have published the rules (using FORM matching only).

ufal / conll2017

Suggestion for dealing with tokenisation mismatches #14