ropensci / taxize

A taxonomic toolbelt for R
https://docs.ropensci.org/taxize
Other
265 stars 59 forks source link

tnrs() with accented "Authority" #231

Closed jarioksa closed 10 years ago

jarioksa commented 10 years ago

tnrs() fails with me when the name author has accented characters. One of the first cases in my species lists is this:

> tnrs("Aconitum lamarckii")
Calling http://taxosaurus.org/retrieve/d87bef2723e5f7e894aef3ccde3b63e5
Error in file(con, "r") : cannot open the connection
Error in names(tmp) <- tolower(names(tmp)) : 
  attempt to set an attribute on NULL

This taxon is in the database and can be retrieved:

$ curl http://taxosaurus.org/retrieve/d87bef2723e5f7e894aef3ccde3b63e5
{"status":"OK","names":[{"matchCount":1,"matches":[{"acceptedName":"Aconitum pyrenaicum subsp. lamarckii","sourceId":"iPlant_TNRS","score":"1","matchedName":"Aconitum lamarckii","annotations":{"Authority":"(Rchb.) O. Bol?s & Vigo"},"uri":"http://www.tropicos.org/Name/100283024"}],"submittedName":"Aconitum lamarckii"}],"metadata":{"spellcheckers":[{"name":"NCBI","description":"NCBI Spell Checker","annotations":{},"uri":"http://www.ncbi.nlm.nih.gov/","sourceId":1,"publication":"http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2578899/","call":"python2.6 tnrs_spellchecker/ncbi_spell.py","rank":1}],"sources":[{"status":"200: OK","name":"NCBI","description":"NCBI Taxonomy","uri":"http://www.ncbi.nlm.nih.gov/taxonomy","annotations":{},"sourceId":"NCBI","publication":"Federhen S. The Taxonomy Project.2002 Oct 9 [Updated 2003 Aug 13]. In: McEntyre J., Ostell J., editors. The NCBI Handbook [Internet]. Bethesda (MD): National Center for Biotechnology Information (US);2002.","rank":3,"code":"ICZN,ICN,ICNB"},{"status":"200: OK","name":"iPlant Collaborative TNRS v3.1","description":"The iPlant Collaborative TNRS provides parsing and fuzzy matching for plant taxa.","uri":"http://tnrs.iplantcollaborative.org/","annotations":{"Authority":"Author attributed to the accepted name (where applicable)."},"sourceId":"iPlant_TNRS","publication":"Boyle, B. et.al. The taxonomic name resolution service: an online tool for automated standardization of plant names. BMC Bioinformatics. 2013, 14:16. doi:10.1186/1471-2105-14-16. If you use TNRS results in a publication, please also cite The Taxonomic Name Resolution Service; http://tnrs.iplantcollaborative.org; version 3.1.","rank":2,"code":"ICN"},{"status":"200: OK","name":"Mammal Species of the World v3.0","description":"Mammal Species of the World, 3rd edition (MSW3) is a database of mammalian taxonomy. Our adaptor searches the indexed database for both exact and loose mathces","uri":"http://www.bucknell.edu/msw3/","annotations":{"Authority":"Don E. Wilson & DeeAnn M. Reeder (editors). 2005. Mammal Species of the World. A Taxonomic and Geographic Reference (3rd ed)"},"sourceId":"MSW3","publication":"Don E. Wilson & DeeAnn M. Reeder (editors). 2005. Mammal Species of the World. A Taxonomic and Geographic Reference (3rd ed)","rank":4,"code":"ICZN"}],"sub_date":"Sun Jan 19 09:07:33 2014","resolver_version":"1.2.0","jobId":"d87bef2723e5f7e894aef3ccde3b63e5"}}

Sorry for the long line: so it comes from curl. I have several such cases in my species list. So many that more than half of the default 30 taxon packages have one of them -- and the whole package will fail with one such taxon. The common feature with these taxa is that the entry will contain iso-8859 coded accented characters. In this cases it is "Authority" Bolòs which appears as Bol?s in the list above. So I use UTF-8 encoding and the output is returned as iso-8859-something. It seems that this happens in ss <- GET(retrieve) in tnrs() which returns data as NA if there is one such a ? character.

Obviously I run UTF-8 locale, and the output is interpreted as such although it is not.

As you see, this is not an issue with taxize but a rather an issue with usage…

Cheers, Jari Oksanen

sckott commented 10 years ago

Hi @jarioksa - Thanks for the bug report!

tnrs has been kind of a mess. POST requests don't seem to work as expected, and they don't seem to return proper JSON objects, e.g. the return header has Content-Type: text/html, and I would think it should be application/json instead.

I am getting the same bug. I'll look into this and get back to you very soon.

jarioksa commented 10 years ago

This really seems to be an issue of text encoding. Taxosaurus returns Latin1 (ISO-8859-15 or related), but it is interpreted as UTF-8 and then illegitimate. The interpretation happens pretty deep in the httr package, and I do not know how to inform those functions about the correct encoding. The call path seems to be taxize::tnrs -> httr::GET -> httr:::make_request (where the last is a non-exported function). Here the contents of headers are:

> str(headers)
List of 8
 $ date          : chr "Mon, 20 Jan 2014 06:19:33 GMT"
 $ server        : chr "Apache/2.2.3 (CentOS)"
 $ content-length: chr "2414"
 $ x-powered-by  : chr "Perl Dancer 1.3111"
 $ connection    : chr "close"
 $ content-type  : chr "text/html; charset=UTF-8"
 $ status        : chr "200"
 $ statusmessage : chr "OK"
 - attr(*, "class")= chr [1:2] "insensitive" "list"

Setting headers$content-type <- "text/html; charset=ISO-8859-15" (that is, not even application/json will return the result:

       submittedname                         acceptedname    sourceid score
1 Aconitum lamarckii Aconitum pyrenaicum subsp. lamarckii iPlant_TNRS     1
         matchedname             annotations
1 Aconitum lamarckii (Rchb.) O. Bolòs & Vigo
                                     uri
1 http://www.tropicos.org/Name/100283024
jarioksa commented 10 years ago

It very much looks to me that this is a problem of mis-configuration in taxosaurus.org web server. If I only ask for a header (with curl -I in terminal), I get the following:

$ curl -I  http://taxosaurus.org/retrieve/d87bef2723e5f7e894aef3ccde3b63e5
HTTP/1.1 200 OK
Date: Mon, 20 Jan 2014 07:53:27 GMT
Server: Apache/2.2.3 (CentOS)
Content-Length: 2414
X-Powered-By: Perl Dancer 1.3111
Connection: close
Content-Type: text/html; charset=UTF-8

So this is the information that the taxosaurus.org server sends to the world, and this wrong: the page will be in Latin1, or some iso-8859 format (iso-8859-15 worked when I tried) instead of UTF-8. Also, if you open the address (the cryptic retrieve one) in a browser, it honours the Content-type information and shows the JSON output with wrong formatting; you must manually select the correct text encoding to see the result.

I haven't found a way to override the wrong header info when reading the page, but I'm not an http wizard. GET takes config argument, but at least I haven't been able to write it so that it would replace the wrong Content-type with the correct one. I tried in vain this one:

GET("http://taxosaurus.org/retrieve/5c767df1a2f6e16c6b8d4b03ed701678", 
config=add_headers("Content-type"="application/json; charset=iso-8859-15"))

An optimal solution might be to ask the good people at Taxosaurus.org server to fix their headers so that the Content-type would match the data they send.

sckott commented 10 years ago

I agree it definitely seems like an encoding issue. Thanks for tracking it down!

I will talk to the Taxosaurus folks and see if they can fix this. I need to talk to them anyway on another issue with the API.

sckott commented 10 years ago

Okay, I have talked with Naim who maintains Taxosaurus, and he said he will fix the encoding issue and hopefully have Content-type as application/json now. I'll update here again when that happens.

jarioksa commented 10 years ago

It seems that this has happened:

tnrs("Aconitum lamarckii")
Calling http://taxosaurus.org/retrieve/773fc6478e8ccf8a21bd46b99459a1e7
       submittedname                         acceptedname    sourceid score
1 Aconitum lamarckii Aconitum pyrenaicum subsp. lamarckii iPlant_TNRS     1
         matchedname             annotations
1 Aconitum lamarckii (Rchb.) O. Bolòs & Vigo
                                     uri
1 http://www.tropicos.org/Name/100283024

Thanks to you Scott and Naim for the work!

sckott commented 10 years ago

Great, that works for me too! It looks like the data is still being returned as text/html, but it's good that the encoding was fixed!