Closed jarioksa closed 10 years ago
Hi @jarioksa - Thanks for the bug report!
tnrs
has been kind of a mess. POST requests don't seem to work as expected, and they don't seem to return proper JSON objects, e.g. the return header has Content-Type: text/html
, and I would think it should be application/json
instead.
I am getting the same bug. I'll look into this and get back to you very soon.
This really seems to be an issue of text encoding. Taxosaurus returns Latin1 (ISO-8859-15 or related), but it is interpreted as UTF-8 and then illegitimate. The interpretation happens pretty deep in the httr
package, and I do not know how to inform those functions about the correct encoding. The call path seems to be taxize::tnrs -> httr::GET -> httr:::make_request
(where the last is a non-exported function). Here the contents of headers
are:
> str(headers)
List of 8
$ date : chr "Mon, 20 Jan 2014 06:19:33 GMT"
$ server : chr "Apache/2.2.3 (CentOS)"
$ content-length: chr "2414"
$ x-powered-by : chr "Perl Dancer 1.3111"
$ connection : chr "close"
$ content-type : chr "text/html; charset=UTF-8"
$ status : chr "200"
$ statusmessage : chr "OK"
- attr(*, "class")= chr [1:2] "insensitive" "list"
Setting headers$content-type <- "text/html; charset=ISO-8859-15"
(that is, not even application/json
will return the result:
submittedname acceptedname sourceid score
1 Aconitum lamarckii Aconitum pyrenaicum subsp. lamarckii iPlant_TNRS 1
matchedname annotations
1 Aconitum lamarckii (Rchb.) O. Bolòs & Vigo
uri
1 http://www.tropicos.org/Name/100283024
It very much looks to me that this is a problem of mis-configuration in taxosaurus.org web server. If I only ask for a header (with curl -I
in terminal), I get the following:
$ curl -I http://taxosaurus.org/retrieve/d87bef2723e5f7e894aef3ccde3b63e5
HTTP/1.1 200 OK
Date: Mon, 20 Jan 2014 07:53:27 GMT
Server: Apache/2.2.3 (CentOS)
Content-Length: 2414
X-Powered-By: Perl Dancer 1.3111
Connection: close
Content-Type: text/html; charset=UTF-8
So this is the information that the taxosaurus.org server sends to the world, and this wrong: the page will be in Latin1, or some iso-8859 format (iso-8859-15 worked when I tried) instead of UTF-8. Also, if you open the address (the cryptic retrieve one) in a browser, it honours the Content-type information and shows the JSON output with wrong formatting; you must manually select the correct text encoding to see the result.
I haven't found a way to override the wrong header info when reading the page, but I'm not an http wizard. GET
takes config
argument, but at least I haven't been able to write it so that it would replace the wrong Content-type with the correct one. I tried in vain this one:
GET("http://taxosaurus.org/retrieve/5c767df1a2f6e16c6b8d4b03ed701678",
config=add_headers("Content-type"="application/json; charset=iso-8859-15"))
An optimal solution might be to ask the good people at Taxosaurus.org server to fix their headers so that the Content-type would match the data they send.
I agree it definitely seems like an encoding issue. Thanks for tracking it down!
I will talk to the Taxosaurus folks and see if they can fix this. I need to talk to them anyway on another issue with the API.
Okay, I have talked with Naim who maintains Taxosaurus, and he said he will fix the encoding issue and hopefully have Content-type
as application/json
now. I'll update here again when that happens.
It seems that this has happened:
tnrs("Aconitum lamarckii")
Calling http://taxosaurus.org/retrieve/773fc6478e8ccf8a21bd46b99459a1e7
submittedname acceptedname sourceid score
1 Aconitum lamarckii Aconitum pyrenaicum subsp. lamarckii iPlant_TNRS 1
matchedname annotations
1 Aconitum lamarckii (Rchb.) O. Bolòs & Vigo
uri
1 http://www.tropicos.org/Name/100283024
Thanks to you Scott and Naim for the work!
Great, that works for me too! It looks like the data is still being returned as text/html
, but it's good that the encoding was fixed!
tnrs() fails with me when the name author has accented characters. One of the first cases in my species lists is this:
This taxon is in the database and can be retrieved:
Sorry for the long line: so it comes from
curl
. I have several such cases in my species list. So many that more than half of the default 30 taxon packages have one of them -- and the whole package will fail with one such taxon. The common feature with these taxa is that the entry will contain iso-8859 coded accented characters. In this cases it is "Authority" Bolòs which appears as Bol?s in the list above. So I use UTF-8 encoding and the output is returned as iso-8859-something. It seems that this happens inss <- GET(retrieve)
intnrs()
which returns data asNA
if there is one such a?
character.Obviously I run UTF-8 locale, and the output is interpreted as such although it is not.
As you see, this is not an issue with
taxize
but a rather an issue with usage…Cheers, Jari Oksanen