scottstanie / sentineleof

Download Sentinel 1 precise orbit files
MIT License
97 stars 18 forks source link

Factorize client interface and provide "download all in range" #64

Open LucHermitte opened 2 months ago

LucHermitte commented 2 months ago

This PR is quite big, I had to make some choices while trying to leave the current API as it is. More on the topic below.

The main changes I propose are:

A few smaller changes are included:

Finally, there are things that are half-way done as I didn't want to completely modify the current API in case you were explicitly relying on it.

LucHermitte commented 1 month ago

Also. I've curated the cassettes from all authentication related secrets.

The strange design in test_eof around cdse_access_token is to support the recording and the usage of cassettes with or without 2FA activated on CDSE account.

scottstanie commented 1 month ago

This is quite a PR @LucHermitte ! it might take a a little awhile to get through it all, but it looks like a lot of improvements. my only initial questions are

do you think the interface changes would still accommodate that addition (which I expect to start using as my own default, since not needing a password is nice, and it'll go much faster on EC2 instances)?

LucHermitte commented 1 month ago

it might take a a little awhile to get through it all

Of course. I know. :(

I'm wondering if there's a way to do that orbit selection without adding the new dependency of sortedcontainers, since it looks like that's only used one in the code (and once in a test)

In the tests definitively. I guess set should be enough. In the code however it's quite practical as I didn't want to add more complexity with a hash on SentinelOrbit (and use set). Also may be there is a way to sort on one criteria and apply a kind of uniq function on another criteria (given a priority order). If you know/see of a standard way of doing it, we could easily remove the dependency here. Any idea?

I've been reminded by ASF that they now have an easier, password-free storage of orbits on S3 [...] do you think the interface changes would still accommodate that addition (which I expect to start using as my own default, since not needing a password is nice, and it'll go much faster on EC2 instances)?

Quite certainly. The idea behind the changes was to keep a single and unique interface which is quite simple:

  1. search according to one criteria or another
  2. download the references found

It should even be possible to add a new S3ASFClient that has the same interface and which doesn't uses a cache. This could be a wrapper around boto3.

Also, IIRC, I've seen another REST based interface to request references to filenames on earthdata server. As I didn't want to bring more changes, I didn't try to replace the current approach. I didn't have much time to investigate if there is indeed another way to search for files according to whatever time and mission based criteria.

scottstanie commented 1 month ago

I'm wondering if there's a way to do that orbit selection without adding the new dependency of sortedcontainers, since it looks like that's only used one in the code (and once in a test)

In the tests definitively. I guess set should be enough. In the code however it's quite practical as I didn't want to add more complexity with a hash on SentinelOrbit (and use set). Also may be there is a way to sort on one criteria and apply a kind of uniq function on another criteria (given a priority order). If you know/see of a standard way of doing it, we could easily remove the dependency here. Any idea?

what about something like this using groupby (not fully tested)

    candidates = [
        item
        for item in data
        if item.start_time <= (t0 - margin0) and item.stop_time >= (t1 + margin1)
    ]

    if not candidates:
        raise ValidityError(
            "none of the input products completely covers the requested "
            "time interval: [t0={}, t1={}]".format(t0, t1)
        )

    # Sort candidates by all attributes except created_time
    candidates.sort(key=lambda x: (x.start_time, x.stop_time, x.id))

    # Group by everything except created_time and select the one with the latest created_time
    result = []
    for _, group in groupby(candidates, key=lambda x: (x.start_time, x.stop_time, x.id)):
        result.append(max(group, key=lambda x: x.created_time))

    return result

(unsure if that's the exact key we want to group on)

LucHermitte commented 1 month ago

I've removed the dependency to sortedcontainers. Using sorted() instead of SortedList is a bit slower on my machine : 40ms. This is negligible.

Regarding my manual implementation of uniq, it does it's job is even less time (2ms). As I'm not used to group_by and as there is no id field in SentinelOrbit, I let it as it is for now.

(I can join my flawed benchmarking test if you're interested -- I don't think it makes sense in sentineleof code base)