occurrences.getpoints response is limited to 10k records

MathewBiddle commented 1 year ago

I think there is a limit on the response from getpoints of 10k records.

When I use the instituteid as the getpoints query it responds with 10k records.

>> import pandas as pd
>> import pyobis
>> df_inst = pd.DataFrame(pyobis.occurrences.getpoints(instituteid='23070').execute())
>> df_inst.shape[0]
10000

However, when iterating through each dataset and use getpoints, and concatenating the responses together, I get 20,297 records.

>> combined = pd.DataFrame()
>> query = pyobis.dataset.search(instituteid='23070')
>> df = pd.DataFrame(query.execute())
>> df_meta = pd.DataFrame.from_records(df["results"])
>> for datasetid in df_meta["id"]:
...    dset = pyobis.occurrences.getpoints(datasetid=datasetid).execute()
...    df = pd.DataFrame(dset)
...    combined = pd.concat([combined, df], ignore_index=True)
>> combined.shape[0]
20297

I think this is on the API side, because https://api.obis.org/v3/occurrence/points?instituteid=23070 only returns 10k records not the expected 20,297.

ayushanand18 commented 1 year ago

Yes, I think this is on the API side because neither size nor skip parameters work at this endpoint.

One observation from your code snippet, pyobis provides a default to_pandas method to every response object across all modules to efficiently convert directly to a pandas DataFrame while also inflating the nested JSON inside some headers so that users need not take the pain. Just like this:

# construct the query
query = pyobis.dataset.search(instituteid='23070', size=100)
# execute the query, this actually helps you save your data which can be used as `query.data` anytime
query.execute()
# we get the dataframe with most nested JSON data uncoiled
df = query.to_pandas()

MathewBiddle commented 1 year ago

Do we have a mechanism to provide feedback to the API maintainer to increase the amount of records the service can return with requests like this?

ayushanand18 commented 1 year ago

I think neither the API is open-source, nor the Swagger Contract is. The best case could be contacting @pieterprovoost for this.

pieterprovoost commented 1 year ago

I have increased the limit to 100,000 and added a warning in the endpoint description. If you are still hitting this limit I recommend using https://api.obis.org/v3/occurrence/grid/8?instituteid=23070 instead and decreasing the precision. This does the same aggregation as /occurrence/points but with a configurable geohash precision and GeoJSON polygon results.

iobis / pyobis

occurrences.getpoints response is limited to 10k records #126