jonathansick / ads_bibdesk

(Unmaintained) Mac OS X service for frictionless import of NASA ADS and arXiv publications into BibDesk.
GNU General Public License v3.0
37 stars 20 forks source link

False duplicate detections #42

Closed jonathansick closed 10 years ago

jonathansick commented 10 years ago

Reports of papers being improperly detected as duplicates and thus clobbering each other. E.g.:

References:

jonathansick commented 10 years ago

The problem is probably with a too aggressive difflib to do fuzzy matching on title/author/abstract fields. We can tune this algorithm, and perhaps add extra logic.

jgizis commented 10 years ago

For a challenging one to distinguish, here are: 2013ApJ...772...79A and 2013arXiv1307.7153A In this case the longer title is meaningful, but perhaps it is too hard.

jonathansick commented 10 years ago

Thanks @jgizis. I think the key is to be more conservative than we already are; better to have a duplicate than lose an unrelated article.

jonathansick commented 10 years ago

@paxperscientiam Good news, I think PR33 might have fixed this. Clone the master branch and try it out on the command line, if you like.

git clone https://github.com/jonathansick/ads_bibdesk.git
cd ads_bibdesk
python setup.py install
adsbibdesk 2010AJ....140..897R
adsbibdesk 2014arXiv1401.0722R

I'll hold on calling it closed, though.

paxperscientiam commented 10 years ago

@jonathansick Thanks for the instructions. Git(hub) is completely new to me, though I'm fairly comfortable with command line (big macports fan here).

Anyhoo...it works! Awesome job.

I also tried the second false duplicate pair I cited on twitter -- 2013ApJ...767L...1L and 2014ApJ...786L..18L -- and again the behavior is now as expected.

Are there any other tests you would have me run in order to help you confirm that this false duplicate bug is resolved?

Edit: I also tried the pair Prof. Gizis suggested, but that didn't work. Well, at least I don't think so. It gives me a notification that it downloaded, but instead of downloading the preprint, it downloads the refereed version. However, this may have always been the intended behavior.

jonathansick commented 10 years ago

@paxperscientiam At this point I just need to read the git diffs to see what happened. I haven't maintained ADS to BibDesk for a year, so the PRs piled up.

jonathansick commented 10 years ago

@paxperscientiam Regarding downloading the refereed version instead of the pre-print -- yes, that is our intended behaviour.