Paper or REAMDE: provide more details on the matching algorithm

dhimmel commented 3 years ago

I'm left wondering:

Whether multiple matches can occur or only the top scoring match is returned?
How is the match score computed?
Does the match have to be exact?
How is self-plagiarism going to be detected if users don't supply the full text?

ad48 commented 3 years ago

Whether multiple matches can occur or only the top scoring match is returned?

it can be set up to do either but, at the moment, it's only set to return 1 article (returning all possible matches might be better - we'll think about changing that). How is the match score computed?
it's a simple logistic regression of the levenshtein distance between the 2 titles and some other numerical data (e.g. 1 for all author-names matching, 0 when the author list doesn't match etc). Does the match have to be exact?
It's essentially a form of fuzzy-matching that we are doing here. So it doesn't have to be exact. How is self-plagiarism going to be detected if users don't supply the full text?
We would only think of this as one stage in the detection process. Start by finding similar titles/authors with a tool like this and then look more closely. How is self-plagiarism going to be detected if users don't supply the full text?

I can clarify the above things in the manuscript. I'll leave the issue open until that's done.

danielskatz commented 3 years ago

in https://github.com/openjournals/joss-reviews/issues/3348

dhimmel commented 3 years ago

Thanks @ad48 for those explanations! By the way, feel free to respond to my questions here like you did, but it is also okay if you just address the issues in the paper or docs. Just mention this issue in the commit or leave a comment with the relevant commit.

dhimmel commented 3 years ago

Updates in c30b755619bf0e9517f848827911797de5e37275. Regarding "some other numerical data", I think it's worth listing out all the features in the model. Thus far I'm aware of the following:

title distance (via Levenshtein distance / fuzz.ratio), continuous
exact author match, binary

Are there others?

Given how this is perhaps the core technical contribution of the paper, would be good to have a bit more detail.

dhimmel commented 3 years ago

I also see

ArXiv preprints with known DOIs are used to train a simple logistic regression to find the correct candidate for each query.

Note that rebuilding the training dataset relies on external APIs and can be a very slow process

How big is the arXiv to Crossref DOI dataset? Is there any reason you can't archive and share it? Would be helpful if anyone want to tweak the scoring model?

ad48 commented 3 years ago

I think it's worth listing out all the features in the model

I have added a part to the paper which shows this. Still editing, not pushed changes yet.

How big is the arXiv to Crossref DOI dataset?

The one we use for this is around 40-50k papers from 2012. The exact number varies depending on data-cleaning steps and...

Is there any reason you can't archive and share it? Would be helpful if anyone want to tweak the scoring model?

The main reason is simply that preprints get published all the time (even old ones), so any dataset would go out of date.

The dataset has a few use-cases besides our rejected article tracker (DOI-resolution for preprints, duplicate article detection) and so we included all the code to rebuild it since that would allow people to customise it to a different use case (or tweak the model).

We could still potentially do this in the interest of speed, though. I'll look into it.

ad48 commented 3 years ago

Thanks again for your help. I have updated the paper to address your requests.

dhimmel commented 3 years ago

Okay, will take a look at a55182c25c61782941d4174c77522de686893766 and 94ab0918d965437b4919ab60206e7762c40b77e8 after dinner (mom says I need to eat now).

dhimmel commented 3 years ago

The new "How the matching algorithm works" section is helpful here! Thanks.

sagepublishing / rejected_article_tracker_pkg

Paper or REAMDE: provide more details on the matching algorithm #7