occurrence search using datasetid

iobis / pyobis

OBIS Python client

https://iobis.github.io/pyobis

MIT License

15 stars 10 forks source link

occurrence search using datasetid #108

Open MathewBiddle opened 2 years ago

MathewBiddle commented 2 years ago

I'm trying to rework an old notebook to use this package.

I have this piece of code:

from pyobis.occurrences import OccQuery

datasetid = '2ae2a2bd-8412-405b-8a9f-b71adc41d4c5'

occ = OccQuery()
dataset = occ.search(datasetid = datasetid)

but it didn't work - took to long.

Here are the expected details from my other process: OBIS Dataset page: https://obis.org/dataset/2ae2a2bd-8412-405b-8a9f-b71adc41d4c5 API request: https://api.obis.org/v3/occurrence?datasetid=2ae2a2bd-8412-405b-8a9f-b71adc41d4c5 Found 698900 records.

ayushanand18 commented 2 years ago

I tried to reproduce the same error, but I found something interesting. There is something going on with request cache.

Step 1: executed occ.search(datasetid = ...), waited for long but didn't get any response. But the get_search_url() built the initial url correctly.
Step 2: executed a dummy query with scientificname as parameter. No datasetid.
Step 3: again did step 1, but this time it worked.

I don't why it behaved as such.

MathewBiddle commented 2 years ago

So, pyobis is doing something odd with the query/response. The turn around time is waay too slow as compared to just using urllib.request.urlopen() and manually building the urls.

I think something is up with requests and how the query is being performed? https://github.com/iobis/pyobis/blob/8c33cae251e93eefcd7ad708d243da43a51f9ab9/pyobis/obisutils.py#L48

This stack overflow thread might be helpful in deducing the issue.

At this point, the package it not very useful to me because it takes +10 mins to run a search and get responses. Granted I am trying to return 621,066 records. But, it works just fine using urllib.

ayushanand18 commented 2 years ago

At this point, the package it not very useful to me because it takes +10 mins to run a search and get responses. Granted I am trying to return 621,066 records. But, it works just fine using urllib.

Thank you so much for highlighting this issue. This was on my to-do for quite some time and I had been experimenting with urllib but couldn't get satisfactory improvements. While improvements from using urllib over requests was only around 25%, the method suggested in the stackoverflow thread you attached brought more than 75% savings in time.

I used a User-agent string header in the request and found that the improvement was really significant. Something like this,

headers = {"User-Agent": "Mozilla/5.0 (X11; CrOS x86_64 12871.102.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.141 Safari/537.36", "Connection":"close"}
out = requests.get(url, params=args, headers = headers, **kwargs)

I'll initiate a PR for this at the earliest. Thanks again!