sckott / pytaxize

python port of taxize (taxonomy toolbelt) for R
https://sckott.github.io/pytaxize/
MIT License
34 stars 13 forks source link

Support multiple names for gnr_resolve() #12

Closed lyttonhao closed 9 years ago

lyttonhao commented 9 years ago

Since the return line in gnr.py only return the first result, current gnr_resolve don't support to return results of multiple names. I change this line to return all results. It works well when the query containing about 100 names, but gets error of " No JSON object could be decoded" when the number is larger. I haven't fixed it. Anyone can help?

sckott commented 9 years ago

hi @lyttonhao I'll take a look later today...

lyttonhao commented 9 years ago

Thanks. @sckott

sckott commented 9 years ago

@lyttonhao I used your fix in your fork for parsing more than 1, and fixed so that works with > 1 name passed in.

Can you share the example that was failing for you?

lyttonhao commented 9 years ago

Okay. I will test the new code soon. Thanks, @sckott.

lyttonhao commented 9 years ago

Hi @sckott, I think there is still a problem as I faced before. When I test 300 names it works well, but it failed when querying 500 or more names. It seems that the parameters should not be too long.

sckott commented 9 years ago

See the documentation for the API http://resolver.globalnames.org/api they allow GET and POST requests. They don't say i think in those docs, but I found https://github.com/ropensci/taxize/blob/master/R/gnr_resolve.R#L23-L25 that 300 does work okay with GET, but after that POST is better

should be a simple thing to add in POST if you are interested

lyttonhao commented 9 years ago

Hi @sckott, I've added some code to work with POST according to https://github.com/ropensci/taxize/blob/master/R/gnr_resolve.R#L86-L97. Below is my corresponding codes:

  elif http == 'post':
        with open('__gnr_names.txt', 'wb') as f:
            for name in names:
                f.write("1|%s\n"%name)
        payload = {'data_source_ids': source, 'format': format,
                'resolve_once': resolve_once, 'with_context': with_context,
                'best_match_only': best_match_only, 'header_only': header_only,
                'preferred_data_sources': preferred_data_sources}
        out = requests.post(url, params = payload, files = {'file': open('__gnr_names.txt', 'rb')} )
        out.raise_for_status()
        result_json = out.json()
        newurl = result_json['url']
        while result_json['status'] == 'working':
           # print result_json['message']
            out = requests.get(url=newurl)
            result_json = out.json()

However, it seems that when _while resultjson['status'] == 'working':, it would be an infinite loop. Can you give some advices? Thank your very much.

sckott commented 9 years ago

@lyttonhao I'll have a look soon, trying to get testing and CI set up first, so we can have checks on all change/PR's, etc.

sckott commented 9 years ago

@lyttonhao That while loop is used because when you use a POST request you get back a URL for a job that is processing, for which you need to send a new GET request to retrieve the data. So the while loop checks pinging the server until it retrieves the data itself, not just a message saying that it is still working. Does that make sense? Send a PR when you think you got is solved, or even if you don't, then I can take a look and see if I can help.

lyttonhao commented 9 years ago

Hi @sckott, I'm very sorry that I've missed your message these days. Do you mean that change the while constraint? I've change it to while not 'data' in result_json:, but it seems that it still doesn't work. Hi, @panks, I try to add "time.sleep(10)" as in your code. I'm afraid it's still in an infinite loop in my computer. Does it work well in your computer when the number of queried name is larger than 1000?

panks commented 9 years ago

@lyttonhao I'm not sure. I think when GNR API starts operating in a queue it's not working as it's supposed to be. Here is a URL response from a job which I submitted more than 6 hours back, for query size = 1,010 http://resolver.globalnames.org/name_resolvers/5jyg8wkhbvoa.json

It still shows status as 'working'. Maybe they need to fix things on their end. But at least, we got it working for query size > 300 but < 1000, by adding POST.

@sckott Any ideas?

lyttonhao commented 9 years ago

@panks I'm also doubt that maybe end-back codes have some bugs. Since I don't use R before, I haven't test taxize. @sckott Does it work well on taxize?

sckott commented 9 years ago

@lyttonhao @panks I'll take a look at this

panks commented 9 years ago

If there isn't any hope of it working, then one thing we can do is split lists of size > 1000 into smaller chunks and concatenate their results.

sckott commented 9 years ago

@panks @lyttonhao I just played with this now in R, and it seems that when number of names > 1000 the job seems to never finish. I am asking about this now - we should probably not pass more than 1000 names, so jus break up into chunks of < 1000 and pass those.

sckott commented 9 years ago

see GlobalNamesArchitecture/gni#37

panks commented 9 years ago

Yeah I guess splitting the list is the best way to go as of now. I will do that and send a PR. Thanks!