iobis / pyobis

OBIS Python client
https://iobis.github.io/pyobis
MIT License
14 stars 10 forks source link

occ.search(...,mof=True) & occ.search(...,mof=False) are inconsistent #61

Closed 7yl4r closed 2 years ago

7yl4r commented 2 years ago

The output of occ.search is structured differently if mof is True.


occ.search(scientificname="Egregia menziesii", hasextensions="MeasurementOrFacts")

image

vs

occ.search(scientificname="Egregia menziesii", mof=True, hasextensions="MeasurementOrFacts")

image


When mof=False, the total number of results is included in a column, but when mof=True it is as if res = res["results"] has been performed.


The output should not change in this way when mof is True vs False.

ayushanand18 commented 2 years ago

The number of records might change because for one single id there are many MoF records present each pertaining to different measurements like substrate, cover and relief.

For the first output when mof=False the output is weird because we need to pass only the results into the pd.DataFrame to unflatten it like,

res = occ.search(scientificname="Egregia menziesii", hasextensions="MeasurementOrFacts")
df = pd.DataFrame(res["results"])

because pyobis already does this in the background when mof=True but not in other cases.

So when we unflatten those nested mof records and then perform an inner join to copy other parameters values from the original records on the id field, we get different number of rows. But since we performed an inner join, this should not create exactly duplicate records until or unless they actually are, I suppose.

ayushanand18 commented 2 years ago

I kept the output of response in JSON with "total" and "results" parameters because I thought pandas might not be the only tool researchers would use to analyse data. Also, we utilise the "total" parameter to paginate in the occurrences.search() function.

ayushanand18 commented 2 years ago
I did an interesting experiment with occurrences->search() function. I did this on two sets of conditions. Condition Description Time taken while fetching records (%)
1 occ.search returns JSON output (we are appending new responses on JSON object) 100%
2 occ.search returns pandas DataFrame (we are appending new responses on a response DataFrame 110.18%

Test Environment:

Some sources of error:

So, even though pandas dataframe is very much useful to us, it might be slower than expected. Although the actual speed of acquiring data would be heavily dependent on the network strength.

7yl4r commented 2 years ago

This makes sense to me. DataFrames are optimized for efficient operations on their chunk of memory so there may be some memory re-allocation going on under the hood whenever df.append is done.

As you say though, network speed limitations are going to be by far the biggest speed factor. My primary concern is making the code maximally approachable. I just pushed a test on the 7yl4r/mof-consistency branch that demonstrates the issue. I am ok with returning either a dict or a pd.df, but it needs to be the same type of object regardless of mof=True or mof=False.

ayushanand18 commented 2 years ago

This makes sense to me. DataFrames are optimized for efficient operations on their chunk of memory so there may be some memory re-allocation going on under the hood whenever df.append is done.

There are strong reasons for this to be true. I think 10% of the time that we save wouldn't be a nice tradeoff against simplicity, consistency and ease of operation. People will eventually need to go deep inside the notebooks/docs to get that thing right with a dict output and this time which they would spend reading would be saved by a dataframe output.

Further, for faster use we might include a json parameter in the function so that user gets the privilege to decide for the kind of output they want. But I think that we might implement it a bit later on.

ayushanand18 commented 2 years ago

I have a PR ready for this issue which implements the fix, it returns all output as a pandas dataframe.