Potential degradation in more recent spaCy versions

Hi Richard,

I've been doing some tests comparing the performance of neuralcoref (on older version of Python/spaCy) with coreferee for English, and I'm noticing some rather concerning degradations in performance with newer versions of coreferee. I'm not ready to share the comparison report for the neuralcoref/coreferee yet -- the data and tests need to be cleaned up, but in the interim, I've been inspecting coreferee's coreference chains across the following versions (both using coreferee 1.2.0)

spaCy 3.2.4, with en_core_web_md and en_core_web_lg
spaCy 3.3.1, with en_core_web_md and en_core_web_lg

I tried generating chains for the below sentences:

Victoria Chen, a well-known business executive, says she is 'really honoured' to see her pay jump to $2.3 million, as she became MegaBucks Corporation's first female executive. Her colleague and long-time business partner, Peter Zhang, says he is extremely pleased with this development. The firm's CEO, Lawrence Willis will be onboarding the new CFO in a few months. He said he is looking forward to the whole experience.

spaCy 3.2.4, `en_core_web_md`

▶ python test_coref.py
Loaded spaCy language model: en_core_web_md
0: Chen(1), she(11), her(19), she(28), Her(37)
1: Corporation(31), firm(59)
2: Zhang(47), he(50)
3: Willis(64), He(76), he(78)
None

spaCy 3.3.1, `en_core_web_md`

▶ python test_coref.py
Loaded spaCy language model: en_core_web_md
0: Chen(1), she(11), her(19), she(28), Her(37)
1: Corporation(31), firm(59)
2: Zhang(47), he(50), He(76), he(78)
None

spaCy 3.2.4, `en_core_web_lg`

▶ python test_coref.py
Loaded spaCy language model: en_core_web_lg
0: Chen(1), she(11), her(19), she(28), Her(37)
1: Corporation(31), firm(59)
2: colleague(38), he(50), He(76), he(78)
None

spaCy 3.3.1, `en_core_web_lg`

▶ python test_coref.py
Loaded spaCy language model: en_core_web_lg
0: Chen(1), she(11), her(19), she(28), Her(37)
1: Corporation(31), firm(59)
2: colleague(38), he(50)
3: Willis(64), He(76), he(78)
None

In both cases, the en_core_web_lg language model returns a result that's considerably worse than the en_core_web_md model, which is itself quite surprising. I'd expect that the dependency parse from the large model would be far superior to the medium model, and so should not produce this noticeably different a result. As can be seen, the en_core_web_lg model is missing entire named entities altogether, and the total number of results in the chain is lower than what we get from the medium model.

Observation

The best result (in which we capture all three named entities -- "Chen", "Zhang" and "Willis" in the coref chain) is obtained with the smallest (en_core_web_md) model in spaCy 3.2.4, and not the newest version with the largest model, which is rather counter-intuitive.

I understand that the most general guideline you can offer is that these sorts of examples are single cases, and that statistically, the models should be more or less comparable. But that's definitely not true in the case of my own private tests (which I will attempt to share shortly) -- in my tests, in which I perform a range of tasks, including parsing, named entity recognition, coreference resolution and gender identification across a dataset of ~100 news articles, I am noticing a recognizable drop in coreferee performance across both these dimensions:

spaCy version (3.3.1 performs worse than 3.2.4, comparing the medium and large models head to head)
Language model size (Medium performs better than large, w.r.t. coreference results)

Again, I fully understand that the one-off example I gave above might seem that it's indeed one-off, but I was wondering if there's something you've noticed in terms of accuracy numbers in your tests. My concern is that the rules for coreferee's English version are not carrying over well with the new spaCy models, particularly in v3.3.x, potentially due to whatever internal changes were made to the language models in the recent release.

The issue comparing neuralcoref (whose performance also seems to be better than coreferee) is a totally different one, and is unrelated to this one I've posted. I'll do my best to clean up my comparison tests of neuralcoref and coreferee and document them (I'm currently trying to separate the different functions I'm performing for my own project, so that I document only the coreference resolution results as clearly as possible. Looking forward to hearing your thoughts!

richardpaulhudson / coreferee