Closed jonathansick closed 10 years ago
The problem is probably with a too aggressive difflib to do fuzzy matching on title/author/abstract fields. We can tune this algorithm, and perhaps add extra logic.
For a challenging one to distinguish, here are: 2013ApJ...772...79A and 2013arXiv1307.7153A In this case the longer title is meaningful, but perhaps it is too hard.
Thanks @jgizis. I think the key is to be more conservative than we already are; better to have a duplicate than lose an unrelated article.
@paxperscientiam Good news, I think PR33 might have fixed this. Clone the master branch and try it out on the command line, if you like.
git clone https://github.com/jonathansick/ads_bibdesk.git
cd ads_bibdesk
python setup.py install
adsbibdesk 2010AJ....140..897R
adsbibdesk 2014arXiv1401.0722R
I'll hold on calling it closed, though.
@jonathansick Thanks for the instructions. Git(hub) is completely new to me, though I'm fairly comfortable with command line (big macports fan here).
Anyhoo...it works! Awesome job.
I also tried the second false duplicate pair I cited on twitter -- 2013ApJ...767L...1L and 2014ApJ...786L..18L -- and again the behavior is now as expected.
Are there any other tests you would have me run in order to help you confirm that this false duplicate bug is resolved?
Edit: I also tried the pair Prof. Gizis suggested, but that didn't work. Well, at least I don't think so. It gives me a notification that it downloaded, but instead of downloading the preprint, it downloads the refereed version. However, this may have always been the intended behavior.
@paxperscientiam At this point I just need to read the git diffs to see what happened. I haven't maintained ADS to BibDesk for a year, so the PRs piled up.
@paxperscientiam Regarding downloading the refereed version instead of the pre-print -- yes, that is our intended behaviour.
Reports of papers being improperly detected as duplicates and thus clobbering each other. E.g.:
References: