distinguishing spacyr from spaCy

kbenoit commented 6 years ago

To be fair to the excellent developers at spaCy you might differentiate between our implementation of their return objects (which come from spaCy in Python lists) and our R objects, which are coerced into data frames and especially for the dependency parse, involve some reformatting. It's quite possible that we are to blame some of the results, e.g.

spacy does not return morphological features + it seems that dependency relationships do not follow the same format as universaldependencies.org

But @amatsuo and I will investigate further.

jwijffels commented 6 years ago

Completely right. If spacyr does some things that I'm not aware of, feel free to report on this. Mark that if you want to reproduce this with the R code in this repository, you need the latest version of udpipe as it contains the function as_conllu which converts spacyr output to conllu format.

devtools::install_github("bnosac/udpipe", build_vignettes = TRUE)

Regarding the English evaluation, what I wanted to point out that UAS and LAS metrics there are misleading probably due to the difference in treebank used (spacy: ontonotes, udpipe: ud_english) when building the model. That's why for English it looks like only the UPOS and XPOS measures seem relevant to compare.

jwijffels commented 6 years ago

Is spacyr doing anything different for English versus other language models?

Regarding morphological features Other language models did return morphological features. They are returned by spacy_parse in the tag field sometimes and the xpos seems to be appended to the morphological features with this character: '__' Is that coming from spacyr or spaCy? My guess is that this is coming from spaCy as in the communication I had on https://github.com/explosion/spaCy/issues/1856 the authors mention that The POS accuracy refers to joint prediction of the tag and morphological features indicating that they somehow pasted together the XPOS and the morphological features.
Regarding LAS and UAS Is spacyr doing something on the content of the dep_rel or head_token_id when you get it from spaCy?

jwijffels commented 6 years ago

@kbenoit I've added overview graphs in the README of this repository to compare a bit more easily the numbers from spacy/udpipe. It seems to show that lemmatisation from spaCy seems lacking. Is that something spacyr is responsible for?

kbenoit commented 6 years ago

We don't do any special things with the NLP part of spaCy in the spacyr package, but rather simply pass through what the Python calls return. This is true for lemmatisation as well. (But @amatsuo can confirm.)

In the overall comparisons, your code says that the same data was used to train udpipe as well as spacyr. But this is not really right, is it? You are in fact using the pre-trained models from spacyr. The comparisons might therefore be a bit unequal in that you are comparing udpipe performance on annotated texts used to train udpipe, versus spacyr performance on the same texts, not used to train spacyr. Probably a set of texts not used to train either would provide the best test of accuracy.

amatsuo commented 6 years ago

Ken is right about how we extract lemma from spaCy output. We just extract lemma_ attributes from each token.

https://github.com/quanteda/spacyr/blob/master/R/spacy_parse.R#L88

jwijffels commented 6 years ago

What strikes me is that the lemmatisation evaluation metrics for spaCy are so low. I can understand if there is no lemmatisation done (for non-English) but apparently for English lemmatisation is also not working as I would have expected it.

Regarding the data used for training/testing. The data from universal dependencies (e.g. dutch: https://github.com/UniversalDependencies/UD_Dutch) consists of a train, dev and test set. For UDPipe the train dataset was used to build the model, the dev dataset to tweak hyperparameters and the test set was left out completely and that left out test data is the one that is used in the evaluation. You can see code of that training at https://github.com/ufal/udpipe/blob/master/training/models-ud-2.0/train.sh . These models used version 2.0 of the UD treebanks. For spaCy, the models were built at the end of 2017 so also on data from version 2.0 of the treebanks as version 2.1 of the treebanks were only released in 2018. They also left out the test data set when building their model but apparently on their website they report numbers on the dev dataset (see https://github.com/explosion/spaCy/issues/1856). And yes, I didn't build these myself I just took the model available from spaCy as indicated in the README.

So to be brief, the dataset used to evaluate the models as used in the evaluation script at https://github.com/jwijffels/udpipe-spacy-comparison/blob/master/udpipe-spacy-comparison.R is not used for training by neither the UDPipe models, nor the spaCy models. And the training data used to build these models (version 2.0 of the UD treebanks) was the same except for English.

For English the UDPipe model was not built on version 2.0 of the UD_English treebank but on the latest 2.1 version of UD_English (training code at https://github.com/bnosac/udpipe.models.ud/blob/master/src/english/train.R). The spaCy model is built on Ontonotes so the comparison for English is basically seeing how good the spaCy model trained on Ontonotes does on the test data from UD_English compared to how good the udpipe model build on the train data from UD_English does on the test data from UD_English

So at least the comparison for non-English seems to be fair. For English it's hard to give final conclusions. I would love to have OntoNotes available in conllu format to build a udpipe model upon but unfortunately that data is not directly available.

honnibal commented 6 years ago

Thanks for your work on this. It's actually really hard to compare a bunch of tools, which is why I've always had fewer comparisons than I really wanted. I've also not thanked @kbenoit and the other spacyr developers enough for their work :)

@jwijffels I think it might be better to just train a spaCy model for English, on the UD treebank. Comparing pre-trained models makes it very difficult to be sure the comparison is "fair".

Of course, at some point it comes down to exactly what question the evaluation is trying to answer. If we're evaluating the training algorithm, then it makes sense to carefully train on the same data. But often, users are interested in the accuracy of the model, for some task they're interested in. This latter question is really hard to evaluate well, because most downstream tasks actually aren't that sensitive to the difference in accuracy of two pretty decent systems.

When I first started working on spaCy, I went by the assumption that very few users would want to train their own models. That's definitely not true now (if it ever was...). So I think it does make sense to evaluate the algorithm, much more than the model --- because people can and do retrain or fine-tune on their own data.


$ spacy convert ~/data/ud-treebanks-conll2017/UD_English/en-ud-train.conllu ~/data/ud-treebanks-conll2017/UD_English
$ spacy convert ~/data/ud-treebanks-conll2017/UD_English/en-ud-dev.conllu ~/data/ud-treebanks-conll2017/UD_English      
spacy train en /tmp/en_ud2.0/ ~/data/ud-treebanks-conll2017/UD_English/en-ud-train.json ~/data/ud-treebanks-conll2017/UD_English/en-ud-dev.json

This should be all you need to do to start training the English model, given the CoNLL 2017 data. I'm always hoping to find better hyper-parameters, especially to make the model faster --- but the defaults here should be okay.

Training starts slow because we begin at batch size 1, and increase to batch size 16, compounding by 0.1% per batch. After the first epoch, each epoch should complete in less than 10 minutes on a modest CPU.

honnibal commented 6 years ago

The tokenization results seem fairly crazy for spaCy. I'd say there's some difference in the pre-processing, probably to do with the fused tokens? Have you looked at the output?

honnibal commented 6 years ago

I realised I wanted to compare against your numbers, so I trained Sequoia. Here's the log:

Itn.    P.Loss  N.Loss  UAS     NER P.  NER R.  NER F.  Tag %   Token %
0       19772.895       0.000   79.209  0.000   0.000   0.000   95.935  100.000 4846.4  0.0                              
1       1550.396        0.000   83.285  0.000   0.000   0.000   96.435  100.000 4848.1  0.0                              
2       516.749 0.000   84.642  0.000   0.000   0.000   96.425  100.000 4476.1  0.0                                      
3       235.267 0.000   85.615  0.000   0.000   0.000   96.445  100.000 4699.0  0.0                                      
4       132.534 0.000   86.153  0.000   0.000   0.000   96.445  100.000 4258.8  0.0                                      
5       74.082  0.000   86.849  0.000   0.000   0.000   96.455  100.000 4788.7  0.0                                      
6       48.891  0.000   87.077  0.000   0.000   0.000   96.385  100.000 4603.6  0.0                                      
7       42.295  0.000   87.014  0.000   0.000   0.000   96.365  100.000 4767.9  0.0                                      
8       36.417  0.000   87.176  0.000   0.000   0.000   96.405  100.000 4737.0  0.0                                      
9       33.480  0.000   87.121  0.000   0.000   0.000   96.435  100.000 4663.6  0.0                                      
10      29.819  0.000   87.283  0.000   0.000   0.000   96.405  100.000 4840.6  0.0                                      
11      27.350  0.000   87.529  0.000   0.000   0.000   96.465  100.000 4593.6  0.0                                      
12      27.127  0.000   87.291  0.000   0.000   0.000   96.435  100.000 4022.5  0.0                                      
13      25.144  0.000   87.125  0.000   0.000   0.000   96.415  100.000 1970.2  0.0                                      
14      22.110  0.000   87.160  0.000   0.000   0.000   96.415  100.000 4436.5  0.0                                      
15      22.651  0.000   87.459  0.000   0.000   0.000   96.405  100.000 2996.1  0.0                                      
16      20.203  0.000   87.181  0.000   0.000   0.000   96.425  100.000 4731.2  0.0                                      
17      18.955  0.000   87.275  0.000   0.000   0.000   96.385  100.000 4262.6  0.0                                      
18      17.345  0.000   87.154  0.000   0.000   0.000   96.355  100.000 4818.9  0.0                                      
19      17.272  0.000   87.063  0.000   0.000   0.000   96.335  100.000 4785.7  0.0                                      
20      15.806  0.000   87.740  0.000   0.000   0.000   96.385  100.000 4852.3  0.0                                      
21      16.941  0.000   87.984  0.000   0.000   0.000   96.345  100.000 4742.9  0.0                                      
22      14.933  0.000   87.450  0.000   0.000   0.000   96.345  100.000 4797.7  0.0                                      
23      13.954  0.000   87.834  0.000   0.000   0.000   96.395  100.000 4819.3  0.0                                      
24      13.727  0.000   87.389  0.000   0.000   0.000   96.355  100.000 4787.7  0.0                                      
25      13.441  0.000   87.245  0.000   0.000   0.000   96.315  100.000 4797.2  0.0                                      
26      13.076  0.000   87.109  0.000   0.000   0.000   96.335  100.000 4666.6  0.0                                      
27      11.823  0.000   87.547  0.000   0.000   0.000   96.335  100.000 4632.8  0.0                                      
28      12.003  0.000   87.124  0.000   0.000   0.000   96.305  100.000 4310.5  0.0                                      
29      11.089  0.000   87.309  0.000   0.000   0.000   96.365  100.000 4719.5  0.0

The column to watch is UAS against the development set, which peaks at 87.984 (which should be rounded to 88.0). Your experiment has spaCy at 82% aligned, while UDPipe is at 84.7% aligned.

The big difference is that I'm training and evaluating on the gold-standard tokens and sentences here. I don't know how the CoNLL 2017 aligned accuracy works, but I notice there's a massive difference between spaCy's F1 and its aligned accuracy, while the gap is smaller for UDPipe. This suggests to me that something's gone wrong in the pre- or post-processing, possibly during the experiment, but equally likely during the training of our French model.

jwijffels commented 6 years ago

In this setting the aim is clearly to evaluate the training algorithm used to build the models. To my understanding the UDPipe models as well as the spaCy models (except English) are built on version 2.0 of the UD treebanks (@honnibal is that the case for the spaCy models for dutch/french-sequioa/portuguese/spanish-ancora? Is the training code available somewhere?).

The exercise consists of

Extracting all UTF-8 encoded sentences from the test data of the UD treebanks as the models have all used the training & dev data and never used the test data
Doing the annotation on these sentences with the UDPipe/spaCy models to get the universal parts of speech tags, the treebank-specific parts of speech tags, the morphological features, the lemma's and the dependency relationship
Putting the result of the annotation in CONLL-U format as described at http://universaldependencies.org/format.html
Running the conllu 2017 evaluation script to compare the numbers as used in the conllu 2017 shared task (http://universaldependencies.org/conll17/evaluation.html)

The point of taking the test set is to see how good the models work on non-seen data. The point of taking the conllu 2017 evaluation script (https://github.com/ufal/conll2017/blob/master/evaluation_script/conll17_ud_eval.py - which is copy-pasted in this repository for conviencence) is to have an external common ground to compare the 2 methods instead of showing internal measures of how good a model is over different iterations of either the UDPipe/spaCy training. That conll17_ud_eval.py python evaluation script evaluates how good a model works in a setting of plain text (so a real-life scenario / no 'gold' scenario where we know already how the words should be split in tokens).

Regarding pre-processing: the following is done.

Preprocessing for Dutch/French-Sequioa/Spanish-Ancora/Portuguese/Italian/Dutch: no pre-processing
Preprocessing for English: see below
All parsing is done by the function spacy_parse from the spacyr R package

Regarding post-processing: the following is done before the spacyr data.frame is put into conllu format

For tokens: no postprocessing
For universal parts of speech tagging: no postprocessing
For treebank-specific parts of speech tagging & morphological feature tagging: spaCy apparently returns the xpos + morphological features pasted with so everything before is considered xpos, everything after the morphological features
For lemmatisation: no postprocessing except for English (see below)
For dependency relationships: ROOT is relabelled as root and if the the dependency relation is ROOT the head token id is put to 0 as specified by the CONLL-U format

Regarding English

In the core, this comparison was set up for non-English as they allow comparison because the treebanks used to build the models are the same, while for English this is not the case.
Nevertheless, while I was doing the other languages, I also did English with the following extra processing
- From the conllu test dataset, sentence newsgroup-groups.google.com_n3td3v_e874a1e5eb995654_ENG_20060120_052200-0011 was removed as it contained unexpected UTF-8 characters
- For lemmatisation and English-only (this did not happen in other languages, why is this English-specific?), spaCy returns -PRON- as the lemma field apparently when the word is a pronoun, in which case I've updated the lemma with the lowercased token
Regardless of this if you look at the result of English, the outcome of the tokenisation and the XPOS measure (penn treebank tagset) seems relevant and indicate spaCy is doing less good on this treebank than expected. Regarding the UAS, I have no clue what is causing this low spaCy score, regarding LAS that might be related to how Ontonotes labels the dependency relations, that's the reason why in the plots I'm only showing XPOS for English as well as the lemmatisation results.
If we want to do the same procedure for English as for the other languages, either a spaCy model will need to be constructed on UD_English or a UDPipe model needs to be constructed on Ontonotes as the UDPipe models were built on UD_English and the spaCy model was built on Ontonotes.

If you want to reproduce this because you think post/pre processing needs to be changed, feel free to do the same steps 1/2/3/4 as indicated above in Python or just run the R code (udpipe-spacy-comparison.R) which is in this repository if you want to reproduce and see the content of the data.

kbenoit commented 6 years ago

Fascinating discussion and glad to see it taking place. We’re just a top-layer package (spacyr) but happy to help in any way if possible.

jwijffels commented 6 years ago

Note to myself, follow-up by spaCy authors at https://github.com/explosion/spaCy/issues/2011

jwijffels / udpipe-spacy-comparison

distinguishing spacyr from spaCy #1