iobis / pyobis

OBIS Python client
https://iobis.github.io/pyobis
MIT License
15 stars 10 forks source link

enables request size and time estimation, fixes "missing 'id' error" in non-paginated queries. #127

Closed ayushanand18 closed 1 year ago

ayushanand18 commented 1 year ago

Overview

Resolves #125 by adding an optional paginate parameter while executing occurrences.search() function.

  1. If fields is not None and paginate is set to True, it adds 'id' to fields.
  2. If paginate is set to False, it fetches only the first 10k records.
  3. Further speed improvements by switching to fetching 10k records at a go from 5k.
  4. Minor refactoring of code to reduce redundancy in data assignments.

Thanks!

ayushanand18 commented 1 year ago

Waiting for #124

ocefpaf commented 1 year ago

@ayushanand18 this is a similar problem we have with ERDDAP clients and pretty much any other package that hits an API that was created to build HTML pages, and b/c of that is paginated.

In a way, the non-HTML client always want the full answer. If we cannot provide that by default due to API limitations we should at warn the user and point them on how to make the request that will return the full answer.

TL;DR adding the keyword, docs, and making the paginated one default is a design that copies the HTML page return but that is rarely what we want.

Does that make sense or am I way off here?

ayushanand18 commented 1 year ago

In a way, the non-HTML client always want the full answer. If we cannot provide that by default due to API limitations we should at warn the user and point them on how to make the request that will return the full answer.

TL;DR adding the keyword, docs, and making the paginated one default is a design that copies the HTML page return but that is rarely what we want.

I think I couldn't get the whole of it. I assume that you are of the opinion that pagination should be turned off by default and we should throw a warning about incomplete results, while we should also enable the users to get paginated results by specifying explicitly or doing themselves.

With that assumption I think I will not be the best person to have an opinion because most of the users of this packages may not know a lot of programming and implementing pagination themselves would be difficult. Maybe @MathewBiddle would have got some insights because he is a real user of this package.

Maybe I interpreted you wrong.

MathewBiddle commented 1 year ago

I agree with

In a way, the non-HTML client always want the full answer. If we cannot provide that by default due to API limitations we should at warn the user and point them on how to make the request that will return the full answer.

In my experience working with "data getters" I always want the full response. That's the beauty of using these tools, if you have some coding skill, you don't have to go page by page downloading the data you're after. One call with the appropriate package should get you everything you're looking for. IMHO.

ocefpaf commented 1 year ago

I think I couldn't get the whole of it. I assume that you are of the opinion that pagination should be turned off by default and we should throw a warning about incomplete results, while we should also enable the users to get paginated results by specifying explicitly or doing themselves.

Sorry I was not clear. The trick part is the balance between what can we do to provide a better experience to the user.

  1. If we can easily return all the records and hide/eliminate the pagination all together that is the best option;
  2. Let's say we cannot b/c that is costly to the API or fetching multiple pages requires complex code that should not be in pyobis, then we should warn the users, and the message must provide how they can request "the next page." This is the second best option IMO;
  3. What you have now. Code-wise it is fine, it doesn't implement a breaking change and it is well documented. However, users may still be taken by surprise if they don't read the docs.
  4. A mix of 1-2 where the response is a lazy object that doesn't fetch anything but the bare bones metadata about the answer. Then the user can inspect it and decide if they want to download everything or loop over step-by-step to avoid stressing the server. (This one seems like a good design and compromise but it is a horrible user experience IMO).

Does that make sense @ayushanand18?

ayushanand18 commented 1 year ago

That makes a lot of sense, thanks for the explanations!

The primary motivation behind separating query building from query execution was saving user's time. In my opinion, a nice tradeoff could be fetching top 1k records at first, and providing the ability for the user to enable pagination by explicitly specifying a flag, say fast_fetch=False while query building. When this flag is True it just fetches top 1k records. When this flag is False (by default), we will throw out a warning about what they are doing and that they should do any of the two methods to ensure complete data is downloaded.

To allow power users to fetch every query completely without taking the extra pain to specify an extra variable, we can allow them to set a global variable for the package say pyobis.config['FAST_FETCH'] = False. This config idea could even help us when will be caching or downloading data directly onto the disk for users who want to work offline or need fast access later.

I will always be thinking of taking the pain to paginate onto pyobis because that will ensure users have to put a lot less efforts. I'm not sure if this is a good approach but I'm open to more of your thoughts.

ocefpaf commented 1 year ago

The primary motivation behind separating query building from query execution was saving user's time. In my opinion, a nice tradeoff could be fetching top 1k records at first, and providing the ability for the user to enable pagination by explicitly specifying a flag, say fast_fetch=False while query building. When this flag is True it just fetches top 1k records. When this flag is False (by default), we will throw out a warning about what they are doing and that they should do any of the two methods to ensure complete data is downloaded.

This is too complex for the users I imagine will be using pyobis. Maybe we need to take a step back and re-think how users create their requests. For example:

  1. Is it easy to shoot your own foot and create a large/slow request?
  2. Or it is obvious to know that your request will be a big one when you create it?

If 1 we need some sort of hold handing but if it is 2 the user should just figure out that their request will be unwieldy. The best we can do is to document it.

Now, if it is 1... We should do the easiest think possible. Maybe the lazy, build/fetch approach is not too bad in that case. Like, I build my query, inspect its size, and then do a query.fetch() after assessing the site. This is quite common in oceanography b/c of the big datasets.

ayushanand18 commented 1 year ago
  1. Or it is obvious to know that your request will be a big one when you create it?

Maybe users who are using pyobis be existing users in the OBIS/GBIF ecosystem and they would know that datasets are large. But they wouldn't if the query they built would be larger or not.

Now, if it is 1... We should do the easiest think possible. Maybe the lazy, build/fetch approach is not too bad in that case. Like, I build my query, inspect its size, and then do a query.fetch() after assessing the site. This is quite common in oceanography b/c of the big datasets.

I like your idea here, and maybe we can implement this. While query building we can just fetch the total size of records and output it for the user to know (or even provide estimated request completion time with some calculations). And then while we paginate we already use the total figure to correctly run the iterations.

ocefpaf commented 1 year ago

And then while we paginate we already use the total figure to correctly run the iterations.

Maybe a fetch_next() and a fetch_all() methods would transfer the management of the download to the user. (I'm sure 99.99% will just call fetch_all() though :-)

ayushanand18 commented 1 year ago

Maybe a fetch_next() and a fetch_all() methods would transfer the management of the download to the user. (I'm sure 99.99% will just call fetch_all() though :-)

Sometimes I feel like we overengineer things when giving as much power to the users (●'◡'●). I would prefer to go with fetch_all by default for the reason you stated.

7yl4r commented 1 year ago

Great convo here. I support the idea that we should build for the typical user and not worry about advanced features until a user asks for them. fetch_all by default sounds like the right approach to me also.

ayushanand18 commented 1 year ago

The new changes:

  1. Now, user gets the total estimated records in its query (while building the query itself).
  2. Along with the estimated time of request completion. The estimated time is nearly accurate for records less than 100k. Beyond this, the estimated time looks 10x the original time taken. This is because of the minor changes in the fraction used to estimate which becomes very large due to scale. I did not want to include a great algorithm behind the calculation so I simply did a small math.
  3. There is no need to enable or disable pagination to avoid errors, now the package itself looks around it. If custom fields are used in the query then the user gets a warning if id is missing. If the id is missed deliberately, then pagination does not occur and top 10k records are sent back. Thus, saving round-trip time.
ayushanand18 commented 1 year ago

Just adding a friendly reminder on this PR for your review :) @ocefpaf

ayushanand18 commented 1 year ago

Apologies @ayushanand18. I assumed that with one approving review you did not wanted a second feedback. Everything look great. I would go all f-strings instead of the .format but that is really minor and not necessary for this PR.

No issues, I'll be always open for your feedback. I'll create another PR with the doc changes + changelog, and add f-strings over there.