Closed prrao87 closed 1 year ago
Hi @prrao87, thanks for this detailed write-up! However, if I've understood correctly, the differences between models you cite are wholly caused by whether or not Zhang
and Willis
are recognised as named entities by the standard spaCy models in question. This may or may not be symptomatic of a wider issue, but even if it is a wider issue it is not one that can be addressed in Coreferee, but rather in spaCy. Perhaps you could ask the question under https://github.com/explosion/spaCy/issues?
Hi Richard,
I've been doing some tests comparing the performance of neuralcoref (on older version of Python/spaCy) with coreferee for English, and I'm noticing some rather concerning degradations in performance with newer versions of coreferee. I'm not ready to share the comparison report for the neuralcoref/coreferee yet -- the data and tests need to be cleaned up, but in the interim, I've been inspecting coreferee's coreference chains across the following versions (both using coreferee 1.2.0)
en_core_web_md
anden_core_web_lg
en_core_web_md
anden_core_web_lg
I tried generating chains for the below sentences:
spaCy 3.2.4,
en_core_web_md
spaCy 3.3.1,
en_core_web_md
spaCy 3.2.4,
en_core_web_lg
spaCy 3.3.1,
en_core_web_lg
In both cases, the
en_core_web_lg
language model returns a result that's considerably worse than theen_core_web_md
model, which is itself quite surprising. I'd expect that the dependency parse from the large model would be far superior to the medium model, and so should not produce this noticeably different a result. As can be seen, theen_core_web_lg
model is missing entire named entities altogether, and the total number of results in the chain is lower than what we get from the medium model.Observation
The best result (in which we capture all three named entities -- "Chen", "Zhang" and "Willis" in the coref chain) is obtained with the smallest (
en_core_web_md
) model in spaCy 3.2.4, and not the newest version with the largest model, which is rather counter-intuitive.I understand that the most general guideline you can offer is that these sorts of examples are single cases, and that statistically, the models should be more or less comparable. But that's definitely not true in the case of my own private tests (which I will attempt to share shortly) -- in my tests, in which I perform a range of tasks, including parsing, named entity recognition, coreference resolution and gender identification across a dataset of ~100 news articles, I am noticing a recognizable drop in coreferee performance across both these dimensions:
Again, I fully understand that the one-off example I gave above might seem that it's indeed one-off, but I was wondering if there's something you've noticed in terms of accuracy numbers in your tests. My concern is that the rules for coreferee's English version are not carrying over well with the new spaCy models, particularly in v3.3.x, potentially due to whatever internal changes were made to the language models in the recent release.
The issue comparing neuralcoref (whose performance also seems to be better than coreferee) is a totally different one, and is unrelated to this one I've posted. I'll do my best to clean up my comparison tests of neuralcoref and coreferee and document them (I'm currently trying to separate the different functions I'm performing for my own project, so that I document only the coreference resolution results as clearly as possible. Looking forward to hearing your thoughts!