Closed dhimmel closed 3 years ago
Whether multiple matches can occur or only the top scoring match is returned?
I can clarify the above things in the manuscript. I'll leave the issue open until that's done.
Thanks @ad48 for those explanations! By the way, feel free to respond to my questions here like you did, but it is also okay if you just address the issues in the paper or docs. Just mention this issue in the commit or leave a comment with the relevant commit.
Updates in c30b755619bf0e9517f848827911797de5e37275. Regarding "some other numerical data", I think it's worth listing out all the features in the model. Thus far I'm aware of the following:
Are there others?
Given how this is perhaps the core technical contribution of the paper, would be good to have a bit more detail.
I also see
ArXiv preprints with known DOIs are used to train a simple logistic regression to find the correct candidate for each query.
Note that rebuilding the training dataset relies on external APIs and can be a very slow process
How big is the arXiv to Crossref DOI dataset? Is there any reason you can't archive and share it? Would be helpful if anyone want to tweak the scoring model?
I think it's worth listing out all the features in the model
I have added a part to the paper which shows this. Still editing, not pushed changes yet.
How big is the arXiv to Crossref DOI dataset?
The one we use for this is around 40-50k papers from 2012. The exact number varies depending on data-cleaning steps and...
Is there any reason you can't archive and share it? Would be helpful if anyone want to tweak the scoring model?
The main reason is simply that preprints get published all the time (even old ones), so any dataset would go out of date.
The dataset has a few use-cases besides our rejected article tracker (DOI-resolution for preprints, duplicate article detection) and so we included all the code to rebuild it since that would allow people to customise it to a different use case (or tweak the model).
We could still potentially do this in the interest of speed, though. I'll look into it.
Thanks again for your help. I have updated the paper to address your requests.
Okay, will take a look at a55182c25c61782941d4174c77522de686893766 and 94ab0918d965437b4919ab60206e7762c40b77e8 after dinner (mom says I need to eat now).
The new "How the matching algorithm works" section is helpful here! Thanks.
I'm left wondering: