Continued problems with accented names

keflavich commented 11 years ago

The tmpfile approach was pretty good, but there are still names that cause problems. For example, this team has written a series of papers: http://adsabs.harvard.edu/abs/2012A%26A...541A..63P with both Phillippe Andre and Vera Konyves as coauthors. There is an umlaut over the o in Konyves and an accent egut over the e in Andre. These are passed as escaped \' and \" in the bibtex. Any idea how these should be sanitized / passed to osascript? Can osascript (can bash?) take unicode?

jonathansick commented 11 years ago

Hmm. Unfortunately I don't have this problem! I ran that article with the latest 3.0.5 build and BibDesk preview shows both accented characters just fine. What version of OS X are you running? what Python? and what BibDesk? Are you running the "ADS to BibDesk" service, or command line version (both work fine for me). Are you building these yourself from the github repo and python build.py script; or are you using the download fom jsick.net/adsbibdesk ?

keflavich commented 11 years ago

I'm using the latest from jonathansick/ads_bibdesk. And with the latest, the above Konyves/Andre paper works. So, that's no longer a problem - must have been a problem with my local version.

However, I now have a problem with THIS one: http://adsabs.harvard.edu/abs/2009ApJ...692...91G

The ads parser seems to fail, perhaps because of "" in the title. Haven't debugged yet...

jonathansick commented 11 years ago

It seems that the problem with the Goodman 09 paper is with the Service version. Running on the command line I get success:

I've had a few other papers that fail with the service, but work on the command line. The only difference is how the input string is passed: either via an automator wrapper, or straight from command line arguments. I'll put some real logging code in to write debug statements to a log file and figure out how automator is mangingling the inputs.

keflavich commented 11 years ago

Strange, I still get the same error with the command line version. I'm pretty sure I'm up to date from your repo; I've done python build.py and in build/adsbibdesk/, python setup.py install.

$ adsbibdesk --version
3.0.5
$ adsbibdesk -d http://adsabs.harvard.edu/abs/2009ApJ...692...91G
article token http://adsabs.harvard.edu/abs/2009ApJ...692...91G
Found ADS page http://adsabs.harvard.edu/abs/2009ApJ...692...91G
derived url http://adsabs.harvard.edu/abs/2009ApJ...692...91G
ADSHTMLParser links: {}
Traceback (most recent call last):
  File "/Users/adam/virtual-python/bin/adsbibdesk", line 9, in <module>
    load_entry_point('adsbibdesk==3.0.5', 'console_scripts', 'adsbibdesk')()
  File "/Users/adam/virtual-python/lib/python2.7/site-packages/adsbibdesk-3.0.5-py2.7.egg/adsbibdesk.py", line 98, in main
    process_articles(options, args)
  File "/Users/adam/virtual-python/lib/python2.7/site-packages/adsbibdesk-3.0.5-py2.7.egg/adsbibdesk.py", line 119, in process_articles
    process_token(articleToken, prefs, insertScript)
  File "/Users/adam/virtual-python/lib/python2.7/site-packages/adsbibdesk-3.0.5-py2.7.egg/adsbibdesk.py", line 145, in process_token
    ads.author[0], '|||',
IndexError: list index out of range

keflavich commented 11 years ago

When I examined the ads object in detail, the link field was empty: it looks like the parser never found any links, despite there being plenty of a href's in the HTML.

jonathansick commented 11 years ago

I'm now logging debug statements to a file. I can see that the Service version complains about not finding links for the Goodman paper, as you describe. The script is able to download the ADS HTML fine, it just can't parse links. My suspicion is that there's a behavioural difference in python distributions; my EPD python is used on the command line, while Mountain Lion's built-in python is probably being used by the Service. I'll update this issue as I find out what's going on.

keflavich commented 11 years ago

What version of python are you using? On my laptop, on 2.7.2, it gets me a timeout error, on 2.7.1 it gives the above error. Were there a lot of recent updates to the HTMLparser, perhaps?

jonathansick commented 11 years ago

Running Python 2.7.1 I can reproduce your bug with the Goodman paper (any many other papers that have quotation marks in their titles). The problem is that ADS puts the paper title straight into the HTML meta data:

<meta name="citation_title" content="The "True" Column Density Distribution in Star-Forming Molecular Clouds" />

Those extra quotes make for bad HTML. It seems that Python 2.7.1's HTMLParser module fails on this line, while 2.7.3 is more robust. I can fix this by preprocessing the HTML and deleting these problematic meta tags from the header.

I don't have python 2.7.2 handy, so I can't say if the timeout error is a symptom of the same problem. You can see how the parser runs by putting a simple print "TAG", tag, attrs at the beginning of the ``handle_starttag() method in ADSHTMLParser.

jonathansick commented 11 years ago

The problem with quotes in titles should now be fixed for Python 2.7.1! See f2902022485a1232ccf15fe7479cdb84c4fb0e04 . I'll close this now; let me know if you have continued problems in Python 2.7.2.

jonathansick / ads_bibdesk

Continued problems with accented names #17