nasa / python_cmr

Python library for querying the common metadata repository.
MIT License
24 stars 22 forks source link

Return query results as an iterator #37

Closed chuckwondo closed 1 week ago

chuckwondo commented 5 months ago

To give users better control over memory consumption, particularly when executing queries that produce large search results (hundreds of thousands, or even millions of granules), query results should not be obtained in their entirety and put into a list all at once before returning the results to the user. This can easily lead to "out of memory" errors for large results. Instead, an iterator (generator) should be returned so the user has the option to limit the number of results held in memory at any given moment.

Since this would be a breaking change to the existing get and get_all query methods, it might be worth defining a new method. However, given that this library has not reached a 1.0 release, making such a breaking change would not be unreasonable. Alternatively, a new method could be added while deprecating the existing ones. Either way, I recommend that returning an iterator being the only option, not an additional option, otherwise it can be confusing to users as to why there are multiple options. This also follows the Zen of Python's "There should be one-- and preferably only one --obvious way to do it." Further, returning an iterator does not prevent the user from populating a list themselves, if they wish.

There has been discussion amongst the users and maintainers of the earthaccess library about this very topic, and a number of us agreed that it makes more sense to implement this here rather than in earthaccess, which simply calls through to this library.

frankinspace commented 5 months ago

Agreed, generator return type would be preferred implementation.

I would concur with the plan to add new functions, mark the existing ones as deprecated indicating that the list-type functions will be removed in the 1.0.0 release. I know this library is used in a few different places but I don't have a good handle on everywhere it gets used; so I think it is important to have at least 1, maybe more, minor versions with the existing functions marked as deprecated before completely removing them.

briannapagan commented 5 months ago

We (GES DISC) are using this library for some operational scripts so would also vote for minor versions before completely removing. Was going to use today's community for earthaccess to discuss this.