Automatically add `id param` to fix "Error in occurrences.execute: key "id" doesnt exist"

MathewBiddle commented 1 year ago

I'm trying to query for records associated with an institution and only returning the latitude/longitude pairs. Unfortunately, it looks like outdf is missing the key id, but I'm wondering if this should be datasetid or something along those lines?

Code snippet:

institution_id = 23070
fields = ['decimalLatitude','decimalLongitude']
pyobis.occurrences.search(instituteid = institution_id, fields=fields).execute()

returns

Fetching: [....................................................................................................] 5000/2000024

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File ~\programs\Miniforge\envs\IOOS\Lib\site-packages\pandas\core\indexes\base.py:3802, in Index.get_loc(self, key, method, tolerance)
   3801 try:
-> 3802     return self._engine.get_loc(casted_key)
   3803 except KeyError as err:

File ~\programs\Miniforge\envs\IOOS\Lib\site-packages\pandas\_libs\index.pyx:138, in pandas._libs.index.IndexEngine.get_loc()

File ~\programs\Miniforge\envs\IOOS\Lib\site-packages\pandas\_libs\index.pyx:165, in pandas._libs.index.IndexEngine.get_loc()

File pandas\_libs\hashtable_class_helper.pxi:5745, in pandas._libs.hashtable.PyObjectHashTable.get_item()

File pandas\_libs\hashtable_class_helper.pxi:5753, in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'id'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Cell In[62], line 1
----> 1 pyobis.occurrences.search(instituteid = institution_id, fields = 'decimalLatitude').execute()

File ~\programs\Miniforge\envs\IOOS\Lib\site-packages\pyobis\occurrences\occurrences.py:106, in OccResponse.execute(self, **kwargs)
     98     outdf = pd.concat(
     99         [
    100             outdf.infer_objects(),
   (...)
    103         ignore_index=True,
    104     )
    105     # make sure that we set the `after` parameter when fetching subsequent records
--> 106     self.__args["after"] = outdf["id"].iloc[-1]
    108 self.__args["size"] = size % 5000
    109 # we have already fetched records as a set of 5000 records each time,
    110 # now we need to get remaining records from the total

File ~\programs\Miniforge\envs\IOOS\Lib\site-packages\pandas\core\frame.py:3807, in DataFrame.__getitem__(self, key)
   3805 if self.columns.nlevels > 1:
   3806     return self._getitem_multilevel(key)
-> 3807 indexer = self.columns.get_loc(key)
   3808 if is_integer(indexer):
   3809     indexer = [indexer]

File ~\programs\Miniforge\envs\IOOS\Lib\site-packages\pandas\core\indexes\base.py:3804, in Index.get_loc(self, key, method, tolerance)
   3802     return self._engine.get_loc(casted_key)
   3803 except KeyError as err:
-> 3804     raise KeyError(key) from err
   3805 except TypeError:
   3806     # If we have a listlike key, _check_indexing_error will raise
   3807     #  InvalidIndexError. Otherwise we fall through and re-raise
   3808     #  the TypeError.
   3809     self._check_indexing_error(key)

KeyError: 'id'

MathewBiddle commented 1 year ago

FWIW, I think this is only an issue when someone is trying to return more than the limit. For example,

pyobis.occurrences.search(datasetid = 'c606c47a-3892-4645-9521-630c9085e59f', fields = fields).execute()

works as expected.

Fetching: [████████████████████████████████████████████████████████████████████████████████████████████████████] 2341/2341
Fetched 2341 records.

ayushanand18 commented 1 year ago

Thanks for raising this issue.

When records exceed the threshold of 10k, the package does pagination in the background. For paginating we require the occurrence id or id field because that can uniquely identify records and is required by the API to successfully fetch subsequent batch of records. When we pass a fields parameter, the API returns only those attributes in the JSON response. But id is still required to paginate. This is why the process throws an error of a missing id. The simple workaround would be to include id too in the fields list.

ayushanand18 commented 1 year ago

Something like this, could work

institution_id = 23070
fields = ['decimalLatitude','decimalLongitude', 'id']
occurrences.search(instituteid = institution_id, fields=fields).execute()

MathewBiddle commented 1 year ago

Thanks for digging into this and figuring out a quick solution! It seems to be working as expected now.

Two thoughts:

We should, at least, update the docs to highlight this. Not sure where to add it, however.
Would it be reasonable to add to the package a mechanism automatically includes 'id' when using fields parameter?

7yl4r commented 1 year ago

Would it be reasonable to add to the package a mechanism automatically includes 'id' when using fields parameter?

This seems reasonable to me unless there a case where you would want to exclude id. I think the only usage is to get only one page of results; if that is wanted it should be implemented differently. I am renaming this issue so we can use it to track this as feature request.

ayushanand18 commented 1 year ago

We should, at least, update the docs to highlight this. Not sure where to add it, however.

Yes, I think the docs need to be updated with all the fixes we have seen so far.

Would it be reasonable to add to the package a mechanism automatically includes 'id' when using fields parameter?

I think we should but I have had one observation especially regarding the coordinates querying. When we query for only decimalLongitude or decimalLatitude then unique values are presented, which would barely make more than 10k for any taxa. Even though we have 20k occurrence records, only a dozen of unique coordinates are presented. So, adding id there would add to high network overhead.

Maybe we can just introduce a paginate=False parameter so that it just doesn't add up id and is quite fast. But by default it will be set to True and we will add id automatically. Thoughts?

7yl4r commented 1 year ago

Yep, that sounds even better.

ayushanand18 commented 1 year ago

EDIT: The request works now.

~Digging into this issue today, and the API returned no records even though the total is still a very large count :)~

The URL I pinged: https://api.obis.org/v3/occurrence?fields=%5B%27decimalLatitude%27%2C+%27decimalLongitude%27%5D&instituteid=23070

iobis / pyobis

Automatically add `id param` to fix "Error in occurrences.execute: key "id" doesnt exist" #125