Closed ayushanand18 closed 1 year ago
Waiting for #124
@ayushanand18 this is a similar problem we have with ERDDAP clients and pretty much any other package that hits an API that was created to build HTML pages, and b/c of that is paginated.
In a way, the non-HTML client always want the full answer. If we cannot provide that by default due to API limitations we should at warn the user and point them on how to make the request that will return the full answer.
TL;DR adding the keyword, docs, and making the paginated one default is a design that copies the HTML page return but that is rarely what we want.
Does that make sense or am I way off here?
In a way, the non-HTML client always want the full answer. If we cannot provide that by default due to API limitations we should at warn the user and point them on how to make the request that will return the full answer.
TL;DR adding the keyword, docs, and making the paginated one default is a design that copies the HTML page return but that is rarely what we want.
I think I couldn't get the whole of it. I assume that you are of the opinion that pagination should be turned off by default and we should throw a warning about incomplete results, while we should also enable the users to get paginated results by specifying explicitly or doing themselves.
With that assumption I think I will not be the best person to have an opinion because most of the users of this packages may not know a lot of programming and implementing pagination themselves would be difficult. Maybe @MathewBiddle would have got some insights because he is a real user of this package.
Maybe I interpreted you wrong.
I agree with
In a way, the non-HTML client always want the full answer. If we cannot provide that by default due to API limitations we should at warn the user and point them on how to make the request that will return the full answer.
In my experience working with "data getters" I always want the full response. That's the beauty of using these tools, if you have some coding skill, you don't have to go page by page downloading the data you're after. One call with the appropriate package should get you everything you're looking for. IMHO.
I think I couldn't get the whole of it. I assume that you are of the opinion that pagination should be turned off by default and we should throw a warning about incomplete results, while we should also enable the users to get paginated results by specifying explicitly or doing themselves.
Sorry I was not clear. The trick part is the balance between what can we do to provide a better experience to the user.
Does that make sense @ayushanand18?
That makes a lot of sense, thanks for the explanations!
The primary motivation behind separating query building from query execution was saving user's time. In my opinion, a nice tradeoff could be fetching top 1k records at first, and providing the ability for the user to enable pagination
by explicitly specifying a flag, say fast_fetch=False
while query building. When this flag is True
it just fetches top 1k records. When this flag is False
(by default), we will throw out a warning about what they are doing and that they should do any of the two methods to ensure complete data is downloaded.
To allow power users to fetch every query completely without taking the extra pain to specify an extra variable, we can allow them to set a global variable for the package say pyobis.config['FAST_FETCH'] = False
. This config
idea could even help us when will be caching or downloading data directly onto the disk for users who want to work offline or need fast access later.
I will always be thinking of taking the pain to paginate onto pyobis
because that will ensure users have to put a lot less efforts. I'm not sure if this is a good approach but I'm open to more of your thoughts.
The primary motivation behind separating query building from query execution was saving user's time. In my opinion, a nice tradeoff could be fetching top 1k records at first, and providing the ability for the user to enable
pagination
by explicitly specifying a flag, sayfast_fetch=False
while query building. When this flag isTrue
it just fetches top 1k records. When this flag isFalse
(by default), we will throw out a warning about what they are doing and that they should do any of the two methods to ensure complete data is downloaded.
This is too complex for the users I imagine will be using pyobis. Maybe we need to take a step back and re-think how users create their requests. For example:
If 1 we need some sort of hold handing but if it is 2 the user should just figure out that their request will be unwieldy. The best we can do is to document it.
Now, if it is 1... We should do the easiest think possible. Maybe the lazy, build/fetch approach is not too bad in that case. Like, I build my query, inspect its size, and then do a query.fetch()
after assessing the site. This is quite common in oceanography b/c of the big datasets.
- Or it is obvious to know that your request will be a big one when you create it?
Maybe users who are using pyobis
be existing users in the OBIS/GBIF ecosystem and they would know that datasets are large. But they wouldn't if the query they built would be larger or not.
Now, if it is 1... We should do the easiest think possible. Maybe the lazy, build/fetch approach is not too bad in that case. Like, I build my query, inspect its size, and then do a
query.fetch()
after assessing the site. This is quite common in oceanography b/c of the big datasets.
I like your idea here, and maybe we can implement this. While query building we can just fetch the total size of records and output it for the user to know (or even provide estimated request completion time with some calculations). And then while we paginate we already use the total figure to correctly run the iterations.
And then while we paginate we already use the total figure to correctly run the iterations.
Maybe a fetch_next()
and a fetch_all()
methods would transfer the management of the download to the user. (I'm sure 99.99% will just call fetch_all()
though :-)
Maybe a
fetch_next()
and afetch_all()
methods would transfer the management of the download to the user. (I'm sure 99.99% will just callfetch_all()
though :-)
Sometimes I feel like we overengineer things when giving as much power to the users (●'◡'●). I would prefer to go with fetch_all
by default for the reason you stated.
Great convo here. I support the idea that we should build for the typical user and not worry about advanced features until a user asks for them. fetch_all
by default sounds like the right approach to me also.
The new changes:
id
is missing. If the id
is missed deliberately, then pagination does not occur and top 10k records are sent back. Thus, saving round-trip time.Just adding a friendly reminder on this PR for your review :) @ocefpaf
Apologies @ayushanand18. I assumed that with one approving review you did not wanted a second feedback. Everything look great. I would go all f-strings instead of the
.format
but that is really minor and not necessary for this PR.
No issues, I'll be always open for your feedback. I'll create another PR with the doc changes + changelog, and add f-strings over there.
Overview
Resolves #125 by adding an optional
paginate
parameter while executingoccurrences.search()
function.fields
is not None andpaginate
is set toTrue
, it adds 'id' tofields
.paginate
is set toFalse
, it fetches only the first 10k records.Thanks!