inukshuk / anystyle

Fast citation reference parsing
https://anystyle.io
Other
1.03k stars 88 forks source link

Language/disciplines distribution? #163

Open silviaegt opened 3 years ago

silviaegt commented 3 years ago

I read in your API documentation that your default parser model is based on this core data set and was wondering if you have any information on the disciplines distribution of it or the languages?

I ran two language detectors through that corpus and thought the results could be interesting:

lang cld2 cld3
en 1276 955
NA 256 303
fr 63 69
de 29 33
la 12  
it 11 16
zu 8  
sr 5  
es 4 6
inukshuk commented 3 years ago

Interesting, thanks!

Which language detection did you use and how confident are you in the results? I'm curious because we have been using this unmaintained Gem for optional language detection. At the time it had much better results than other language detectors we tested but there might be more choices available nowadays and perhaps we could switch to one of them.

The core data set was originally based on the CORA data set but was heavily edited and amended during the course of a number of different projects or development phases. All the most recent changes are mostly using data that was parsed on anystye.io -- selected on the basis of features that required testing or refinement at the time.

To get a better picture of the distribution you might want to parse the gold data set too which we use to test new versions of the model for regressions and we've regularly moved references between the two sets.

silviaegt commented 3 years ago

I used cld3 and cld2 in R. I would say I'm 70% confident but I would have to run other experiments to confirm. I just have compared it with the spacy language detection algorithm and it seems to be better. It varies from language to language: i.e. 3 from the 6 strings recognized a Spanish by cld3 are really spa, but 4/4 with cld2. Now that you mention CORA, I was under the impression, because of this paper that the this database was mostly Computer Science papers, but when I look at your XML I see articles such as:

And it makes me wonder....

If I may ask, how was that other "gold set formed"?

inukshuk commented 3 years ago

It's easily possible that by now only a small fraction of the original data set remains. If I remember correctly, we had only between 300-500 entries from CORA at the start. It's probably less than that today, and the set has over 1,500 references, so the legacy of the CORA set is probably not very strong anymore.

The gold set simply emerged while working on the parser over time: it should contain only manually curated references (i.e., at least one contributor must have looked at the reference and deemed it correct; it's probably not perfect, but it's a good data set to use to test new models for regressions). As with the core set, the source is references that were parsed on anystyle.io and marked as eligible for training and selected pretty much randomly during development.

mjy commented 3 years ago

I wonder if this type of explanation is worth documenting somewhere, institutional knowledge style? Maybe gold needs its own repo, README etc.

inukshuk commented 3 years ago

A high-level documentation of how the parser and finder models work would certainly benefit anyone looking to create their own models or adding or contributing new functionality. There are some issue threads around here that go into some detail, but it would certainly be worth it. It would also be helpful to document best practices for training new models (e.g., using gold for regressions, how to quickly compare/diff two the parse results of two models and so on).

Curating and documenting the core, gold, or additional data sets is a separate endeavor I'm happy to support if anyone wants to make an effort to do so. To give you an idea of the available data not currently in the repository here, we have more than 17k references that were uploaded on anystyle.io for training last year. In my experience the quality of these submissions varies greatly, because those uploads are uncoordinated and have little consistency (that said, it's much easier to select suitable data from such a pool of data than creating them from scratch so they are invaluable when working on the parser).