fetch_all argument for dataset search methods

SpacemanPaul commented 4 months ago

Reason for this pull request

Add fetch_all: bool = False argument to all (non-deprecated) dataset search methods, as per EP-13.

Proposed changes

fetch_all: bool argument (defaulting to False) added to all non-deprecated dataset search methods. When true, all search results are gathered and returned in a fully instantiated list. When false (the default), the method acts as a generator and yields a result at a time.

Notes:

This typically requires wrapper methods as simply having a yield statement in a method/function makes it act as a generator - you can't decide dynamically whether to yield or return from a function. Basic form looks like:

def method(self, ..., fetch_all: bool = False, **query):
_results = self._method(...., **query)
if fetch_all:
    return list(_results)   # return as instantiated list
else:
    return _results         # return as generator.

The search_by_products() method has two layers of lazy evaluation (in generator mode, yields tuples containing a product and a nested generator that yields datasets. The fetch_all=True version fully instantiates both layers.
- [x] Tests added / passed
- [x] Fully documented, including docs/about/whats_new.rst for all changes

codecov[bot] commented 4 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Comparison is base (2ee5434) 85.65% compared to head (8ee99fb) 85.62%. Report is 2 commits behind head on develop-1.9.

Additional details and impacted files

```diff @@ Coverage Diff @@ ## develop-1.9 #1553 +/- ## =============================================== - Coverage 85.65% 85.62% -0.04% =============================================== Files 140 140 Lines 15175 15279 +104 =============================================== + Hits 12998 13082 +84 - Misses 2177 2197 +20 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

Ariana-B commented 4 months ago

It would be really neat if this behaviour could be implemented via a decorator rather than multiple wrapper functions that are all but identical

SpacemanPaul commented 4 months ago

It would be really neat if this behaviour could be implemented via a decorator rather than multiple wrapper functions that are all but identical

Yeah, I thought about that but I couldn't see how to do it without significantly complicating the contract between the abstract base class and the implementations. Definitely something that could be revisited though.

omad commented 4 months ago

I think we're better off without this. The added complexity both internally and in the API isn't worth it. The extra argument requires more explanation and make it less clear what the functions do.

Python is full of iterator functions (especially range() and enumerate()), that users need to call list() on if they want it fully loaded into memory. I think we should do the same with search results.

We could include examples in the documentation of using list() on the results, if we think that's a valuable thing for people to do. (I'm not sure if it is.)

omad commented 4 months ago

search_by_products() is quite a different method, so maybe it could be different. I'm not sure how and how much it is used.

SpacemanPaul commented 4 months ago

I think we're better off without this.

Yeah, I can see that. The main use case as far as I can see is internal - for tests. It's much easier to check test results if they are instantiated into a list.

The main justification is the deprecation/removal of dc.index.datasets.search_eager() - which is just an alias for list(dc.index.datasets.search() - I find the name "search_eager"`very unclear and ambiguous, but it has been heavily used historically in tests. I find this way of doing the same thing more intuitive, as well as being more generally applicable.

Having said that, I don't mind dropping it - except that I built the other more useful PR on top of this one, so dropping this one would cause me some short term pain.

omad commented 4 months ago

Yeah, I can see that. The main use case as far as I can see is internal - for tests. It's much easier to check test results if they are instantiated into a list.

The main justification is the deprecation/removal of dc.index.datasets.search_eager() - which is just an alias for list(dc.index.datasets.search() - I find the name "search_eager"`very unclear and ambiguous, but it has been heavily used historically in tests. I find this way of doing the same thing more intuitive, as well as being more generally applicable.

I completely support removing search_eager() for being confusing, and for simplifying the API.

If it's mostly for in tests, I even more think that we should be calling list() explicitly. It's such a common idiom and makes it clearer what is happening. It's also helpful for users of the API as documentation.

Having said that, I don't mind dropping it - except that I built the other more useful PR on top of this one, so dropping this one would cause me some short term pain.

Sorry!

I do think short term pain over long term pain and support though. It's hopefully not too hard to remove from the next PR.

opendatacube / datacube-core