Open kbenoit opened 6 years ago
Completely right. If spacyr does some things that I'm not aware of, feel free to report on this.
Mark that if you want to reproduce this with the R code in this repository, you need the latest version of udpipe as it contains the function as_conllu
which converts spacyr output to conllu format.
devtools::install_github("bnosac/udpipe", build_vignettes = TRUE)
Regarding the English evaluation, what I wanted to point out that UAS and LAS metrics there are misleading probably due to the difference in treebank used (spacy: ontonotes, udpipe: ud_english) when building the model. That's why for English it looks like only the UPOS and XPOS measures seem relevant to compare.
Is spacyr doing anything different for English versus other language models?
Regarding morphological features
Other language models did return morphological features. They are returned by spacy_parse in the tag field sometimes and the xpos seems to be appended to the morphological features with this character: '__'
Is that coming from spacyr or spaCy? My guess is that this is coming from spaCy as in the communication I had on https://github.com/explosion/spaCy/issues/1856 the authors mention that The POS accuracy refers to joint prediction of the tag and morphological features
indicating that they somehow pasted together the XPOS and the morphological features.
Regarding LAS and UAS Is spacyr doing something on the content of the dep_rel or head_token_id when you get it from spaCy?
@kbenoit I've added overview graphs in the README of this repository to compare a bit more easily the numbers from spacy/udpipe. It seems to show that lemmatisation from spaCy seems lacking. Is that something spacyr is responsible for?
We don't do any special things with the NLP part of spaCy in the spacyr package, but rather simply pass through what the Python calls return. This is true for lemmatisation as well. (But @amatsuo can confirm.)
In the overall comparisons, your code says that the same data was used to train udpipe as well as spacyr. But this is not really right, is it? You are in fact using the pre-trained models from spacyr. The comparisons might therefore be a bit unequal in that you are comparing udpipe performance on annotated texts used to train udpipe, versus spacyr performance on the same texts, not used to train spacyr. Probably a set of texts not used to train either would provide the best test of accuracy.
Ken is right about how we extract lemma from spaCy output. We just extract lemma_
attributes from each token.
https://github.com/quanteda/spacyr/blob/master/R/spacy_parse.R#L88
What strikes me is that the lemmatisation evaluation metrics for spaCy are so low. I can understand if there is no lemmatisation done (for non-English) but apparently for English lemmatisation is also not working as I would have expected it.
Regarding the data used for training/testing. The data from universal dependencies (e.g. dutch: https://github.com/UniversalDependencies/UD_Dutch) consists of a train, dev and test set. For UDPipe the train dataset was used to build the model, the dev dataset to tweak hyperparameters and the test set was left out completely and that left out test data is the one that is used in the evaluation. You can see code of that training at https://github.com/ufal/udpipe/blob/master/training/models-ud-2.0/train.sh . These models used version 2.0 of the UD treebanks. For spaCy, the models were built at the end of 2017 so also on data from version 2.0 of the treebanks as version 2.1 of the treebanks were only released in 2018. They also left out the test data set when building their model but apparently on their website they report numbers on the dev dataset (see https://github.com/explosion/spaCy/issues/1856). And yes, I didn't build these myself I just took the model available from spaCy as indicated in the README.
So to be brief, the dataset used to evaluate the models as used in the evaluation script at https://github.com/jwijffels/udpipe-spacy-comparison/blob/master/udpipe-spacy-comparison.R is not used for training by neither the UDPipe models, nor the spaCy models. And the training data used to build these models (version 2.0 of the UD treebanks) was the same except for English.
For English the UDPipe model was not built on version 2.0 of the UD_English treebank but on the latest 2.1 version of UD_English (training code at https://github.com/bnosac/udpipe.models.ud/blob/master/src/english/train.R). The spaCy model is built on Ontonotes so the comparison for English is basically seeing how good the spaCy model trained on Ontonotes does on the test data from UD_English compared to how good the udpipe model build on the train data from UD_English does on the test data from UD_English
So at least the comparison for non-English seems to be fair. For English it's hard to give final conclusions. I would love to have OntoNotes available in conllu format to build a udpipe model upon but unfortunately that data is not directly available.
Thanks for your work on this. It's actually really hard to compare a bunch of tools, which is why I've always had fewer comparisons than I really wanted. I've also not thanked @kbenoit and the other spacyr developers enough for their work :)
@jwijffels I think it might be better to just train a spaCy model for English, on the UD treebank. Comparing pre-trained models makes it very difficult to be sure the comparison is "fair".
Of course, at some point it comes down to exactly what question the evaluation is trying to answer. If we're evaluating the training algorithm, then it makes sense to carefully train on the same data. But often, users are interested in the accuracy of the model, for some task they're interested in. This latter question is really hard to evaluate well, because most downstream tasks actually aren't that sensitive to the difference in accuracy of two pretty decent systems.
When I first started working on spaCy, I went by the assumption that very few users would want to train their own models. That's definitely not true now (if it ever was...). So I think it does make sense to evaluate the algorithm, much more than the model --- because people can and do retrain or fine-tune on their own data.
$ spacy convert ~/data/ud-treebanks-conll2017/UD_English/en-ud-train.conllu ~/data/ud-treebanks-conll2017/UD_English
$ spacy convert ~/data/ud-treebanks-conll2017/UD_English/en-ud-dev.conllu ~/data/ud-treebanks-conll2017/UD_English
spacy train en /tmp/en_ud2.0/ ~/data/ud-treebanks-conll2017/UD_English/en-ud-train.json ~/data/ud-treebanks-conll2017/UD_English/en-ud-dev.json
This should be all you need to do to start training the English model, given the CoNLL 2017 data. I'm always hoping to find better hyper-parameters, especially to make the model faster --- but the defaults here should be okay.
Training starts slow because we begin at batch size 1, and increase to batch size 16, compounding by 0.1% per batch. After the first epoch, each epoch should complete in less than 10 minutes on a modest CPU.
The tokenization results seem fairly crazy for spaCy. I'd say there's some difference in the pre-processing, probably to do with the fused tokens? Have you looked at the output?
I realised I wanted to compare against your numbers, so I trained Sequoia. Here's the log:
Itn. P.Loss N.Loss UAS NER P. NER R. NER F. Tag % Token %
0 19772.895 0.000 79.209 0.000 0.000 0.000 95.935 100.000 4846.4 0.0
1 1550.396 0.000 83.285 0.000 0.000 0.000 96.435 100.000 4848.1 0.0
2 516.749 0.000 84.642 0.000 0.000 0.000 96.425 100.000 4476.1 0.0
3 235.267 0.000 85.615 0.000 0.000 0.000 96.445 100.000 4699.0 0.0
4 132.534 0.000 86.153 0.000 0.000 0.000 96.445 100.000 4258.8 0.0
5 74.082 0.000 86.849 0.000 0.000 0.000 96.455 100.000 4788.7 0.0
6 48.891 0.000 87.077 0.000 0.000 0.000 96.385 100.000 4603.6 0.0
7 42.295 0.000 87.014 0.000 0.000 0.000 96.365 100.000 4767.9 0.0
8 36.417 0.000 87.176 0.000 0.000 0.000 96.405 100.000 4737.0 0.0
9 33.480 0.000 87.121 0.000 0.000 0.000 96.435 100.000 4663.6 0.0
10 29.819 0.000 87.283 0.000 0.000 0.000 96.405 100.000 4840.6 0.0
11 27.350 0.000 87.529 0.000 0.000 0.000 96.465 100.000 4593.6 0.0
12 27.127 0.000 87.291 0.000 0.000 0.000 96.435 100.000 4022.5 0.0
13 25.144 0.000 87.125 0.000 0.000 0.000 96.415 100.000 1970.2 0.0
14 22.110 0.000 87.160 0.000 0.000 0.000 96.415 100.000 4436.5 0.0
15 22.651 0.000 87.459 0.000 0.000 0.000 96.405 100.000 2996.1 0.0
16 20.203 0.000 87.181 0.000 0.000 0.000 96.425 100.000 4731.2 0.0
17 18.955 0.000 87.275 0.000 0.000 0.000 96.385 100.000 4262.6 0.0
18 17.345 0.000 87.154 0.000 0.000 0.000 96.355 100.000 4818.9 0.0
19 17.272 0.000 87.063 0.000 0.000 0.000 96.335 100.000 4785.7 0.0
20 15.806 0.000 87.740 0.000 0.000 0.000 96.385 100.000 4852.3 0.0
21 16.941 0.000 87.984 0.000 0.000 0.000 96.345 100.000 4742.9 0.0
22 14.933 0.000 87.450 0.000 0.000 0.000 96.345 100.000 4797.7 0.0
23 13.954 0.000 87.834 0.000 0.000 0.000 96.395 100.000 4819.3 0.0
24 13.727 0.000 87.389 0.000 0.000 0.000 96.355 100.000 4787.7 0.0
25 13.441 0.000 87.245 0.000 0.000 0.000 96.315 100.000 4797.2 0.0
26 13.076 0.000 87.109 0.000 0.000 0.000 96.335 100.000 4666.6 0.0
27 11.823 0.000 87.547 0.000 0.000 0.000 96.335 100.000 4632.8 0.0
28 12.003 0.000 87.124 0.000 0.000 0.000 96.305 100.000 4310.5 0.0
29 11.089 0.000 87.309 0.000 0.000 0.000 96.365 100.000 4719.5 0.0
The column to watch is UAS against the development set, which peaks at 87.984 (which should be rounded to 88.0). Your experiment has spaCy at 82% aligned, while UDPipe is at 84.7% aligned.
The big difference is that I'm training and evaluating on the gold-standard tokens and sentences here. I don't know how the CoNLL 2017 aligned accuracy works, but I notice there's a massive difference between spaCy's F1 and its aligned accuracy, while the gap is smaller for UDPipe. This suggests to me that something's gone wrong in the pre- or post-processing, possibly during the experiment, but equally likely during the training of our French model.
In this setting the aim is clearly to evaluate the training algorithm used to build the models. To my understanding the UDPipe models as well as the spaCy models (except English) are built on version 2.0 of the UD treebanks (@honnibal is that the case for the spaCy models for dutch/french-sequioa/portuguese/spanish-ancora? Is the training code available somewhere?).
The exercise consists of
The point of taking the test set is to see how good the models work on non-seen data. The point of taking the conllu 2017 evaluation script (https://github.com/ufal/conll2017/blob/master/evaluation_script/conll17_ud_eval.py - which is copy-pasted in this repository for conviencence) is to have an external common ground to compare the 2 methods instead of showing internal measures of how good a model is over different iterations of either the UDPipe/spaCy training. That conll17_ud_eval.py python evaluation script evaluates how good a model works in a setting of plain text (so a real-life scenario / no 'gold' scenario where we know already how the words should be split in tokens).
Regarding pre-processing: the following is done.
Regarding post-processing: the following is done before the spacyr data.frame is put into conllu format
Regarding English
If you want to reproduce this because you think post/pre processing needs to be changed, feel free to do the same steps 1/2/3/4 as indicated above in Python or just run the R code (udpipe-spacy-comparison.R) which is in this repository if you want to reproduce and see the content of the data.
Fascinating discussion and glad to see it taking place. We’re just a top-layer package (spacyr) but happy to help in any way if possible.
Note to myself, follow-up by spaCy authors at https://github.com/explosion/spaCy/issues/2011
To be fair to the excellent developers at spaCy you might differentiate between our implementation of their return objects (which come from spaCy in Python lists) and our R objects, which are coerced into data frames and especially for the dependency parse, involve some reformatting. It's quite possible that we are to blame some of the results, e.g.
But @amatsuo and I will investigate further.