occ.search(...,mof=True) & occ.search(...,mof=False) are inconsistent

7yl4r commented 2 years ago

The output of occ.search is structured differently if mof is True.

occ.search(scientificname="Egregia menziesii", hasextensions="MeasurementOrFacts")

vs

occ.search(scientificname="Egregia menziesii", mof=True, hasextensions="MeasurementOrFacts")

When mof=False, the total number of results is included in a column, but when mof=True it is as if res = res["results"] has been performed.

The output should not change in this way when mof is True vs False.

ayushanand18 commented 2 years ago

The number of records might change because for one single id there are many MoF records present each pertaining to different measurements like substrate, cover and relief.

For the first output when mof=False the output is weird because we need to pass only the results into the pd.DataFrame to unflatten it like,

res = occ.search(scientificname="Egregia menziesii", hasextensions="MeasurementOrFacts")
df = pd.DataFrame(res["results"])

because pyobis already does this in the background when mof=True but not in other cases.

So when we unflatten those nested mof records and then perform an inner join to copy other parameters values from the original records on the id field, we get different number of rows. But since we performed an inner join, this should not create exactly duplicate records until or unless they actually are, I suppose.

ayushanand18 commented 2 years ago

I kept the output of response in JSON with "total" and "results" parameters because I thought pandas might not be the only tool researchers would use to analyse data. Also, we utilise the "total" parameter to paginate in the occurrences.search() function.

ayushanand18 commented 2 years ago

I did an interesting experiment with occurrences->search() function. I did this on two sets of conditions. Condition	Description	Time taken while fetching records (%)
1	occ.search returns JSON output (we are appending new responses on JSON object)	100%
2	occ.search returns pandas DataFrame (we are appending new responses on a response DataFrame	110.18%

Test Environment:

network speed was kept constant by throttling it to 3G connection
test conducted on the exact same version of the packages
the dev environment was exactly the same
I fetched atleast 62k records at once so as to minimize the effect of time taken by other marginal processes like response processing, etc.

Some sources of error:

there might be some marginal deflections due to the state of processor, and/or the processes that were already running on the system.

So, even though pandas dataframe is very much useful to us, it might be slower than expected. Although the actual speed of acquiring data would be heavily dependent on the network strength.

7yl4r commented 2 years ago

This makes sense to me. DataFrames are optimized for efficient operations on their chunk of memory so there may be some memory re-allocation going on under the hood whenever df.append is done.

As you say though, network speed limitations are going to be by far the biggest speed factor. My primary concern is making the code maximally approachable. I just pushed a test on the 7yl4r/mof-consistency branch that demonstrates the issue. I am ok with returning either a dict or a pd.df, but it needs to be the same type of object regardless of mof=True or mof=False.

ayushanand18 commented 2 years ago

This makes sense to me. DataFrames are optimized for efficient operations on their chunk of memory so there may be some memory re-allocation going on under the hood whenever df.append is done.

There are strong reasons for this to be true. I think 10% of the time that we save wouldn't be a nice tradeoff against simplicity, consistency and ease of operation. People will eventually need to go deep inside the notebooks/docs to get that thing right with a dict output and this time which they would spend reading would be saved by a dataframe output.

Further, for faster use we might include a json parameter in the function so that user gets the privilege to decide for the kind of output they want. But I think that we might implement it a bit later on.

ayushanand18 commented 2 years ago

I have a PR ready for this issue which implements the fix, it returns all output as a pandas dataframe.

iobis / pyobis

occ.search(...,mof=True) & occ.search(...,mof=False) are inconsistent #61