Open LucHermitte opened 2 months ago
Also. I've curated the cassettes from all authentication related secrets.
The strange design in test_eof around cdse_access_token
is to support the recording and the usage of cassettes with or without 2FA activated on CDSE account.
This is quite a PR @LucHermitte ! it might take a a little awhile to get through it all, but it looks like a lot of improvements. my only initial questions are
sortedcontainers
, since it looks like that's only used one in the code (and once in a test)aws --no-sign-request s3 ls s3://s1-orbits/AUX_POEORB/ | head -3
2024-07-26 22:04:55 4409614 S1A_OPER_AUX_POEORB_OPOD_20210203T122423_V20210113T225942_20210115T005942.EOF
2024-07-26 22:05:06 4409601 S1A_OPER_AUX_POEORB_OPOD_20210204T122413_V20210114T225942_20210116T005942.EOF
2024-07-26 22:06:36 4409874 S1A_OPER_AUX_POEORB_OPOD_20210205T122416_V20210115T225942_20210117T005942.EOF
do you think the interface changes would still accommodate that addition (which I expect to start using as my own default, since not needing a password is nice, and it'll go much faster on EC2 instances)?
it might take a a little awhile to get through it all
Of course. I know. :(
I'm wondering if there's a way to do that orbit selection without adding the new dependency of sortedcontainers, since it looks like that's only used one in the code (and once in a test)
In the tests definitively. I guess set
should be enough. In the code however it's quite practical as I didn't want to add more complexity with a hash on SentinelOrbit
(and use set
). Also may be there is a way to sort on one criteria and apply a kind of uniq function on another criteria (given a priority order). If you know/see of a standard way of doing it, we could easily remove the dependency here. Any idea?
I've been reminded by ASF that they now have an easier, password-free storage of orbits on S3 [...] do you think the interface changes would still accommodate that addition (which I expect to start using as my own default, since not needing a password is nice, and it'll go much faster on EC2 instances)?
Quite certainly. The idea behind the changes was to keep a single and unique interface which is quite simple:
It should even be possible to add a new S3ASFClient
that has the same interface and which doesn't uses a cache. This could be a wrapper around boto3.
Also, IIRC, I've seen another REST based interface to request references to filenames on earthdata server. As I didn't want to bring more changes, I didn't try to replace the current approach. I didn't have much time to investigate if there is indeed another way to search for files according to whatever time and mission based criteria.
I'm wondering if there's a way to do that orbit selection without adding the new dependency of sortedcontainers, since it looks like that's only used one in the code (and once in a test)
In the tests definitively. I guess
set
should be enough. In the code however it's quite practical as I didn't want to add more complexity with a hash onSentinelOrbit
(and useset
). Also may be there is a way to sort on one criteria and apply a kind of uniq function on another criteria (given a priority order). If you know/see of a standard way of doing it, we could easily remove the dependency here. Any idea?
what about something like this using groupby
(not fully tested)
candidates = [
item
for item in data
if item.start_time <= (t0 - margin0) and item.stop_time >= (t1 + margin1)
]
if not candidates:
raise ValidityError(
"none of the input products completely covers the requested "
"time interval: [t0={}, t1={}]".format(t0, t1)
)
# Sort candidates by all attributes except created_time
candidates.sort(key=lambda x: (x.start_time, x.stop_time, x.id))
# Group by everything except created_time and select the one with the latest created_time
result = []
for _, group in groupby(candidates, key=lambda x: (x.start_time, x.stop_time, x.id)):
result.append(max(group, key=lambda x: x.created_time))
return result
(unsure if that's the exact key we want to group on)
I've removed the dependency to sortedcontainers. Using sorted()
instead of SortedList
is a bit slower on my machine : 40ms. This is negligible.
Regarding my manual implementation of uniq
, it does it's job is even less time (2ms). As I'm not used to group_by
and as there is no id
field in SentinelOrbit
, I let it as it is for now.
(I can join my flawed benchmarking test if you're interested -- I don't think it makes sense in sentineleof code base)
This PR is quite big, I had to make some choices while trying to leave the current API as it is. More on the topic below.
The main changes I propose are:
ASFClient
andDataspaceClient
by:Client
that factorizes a common interface for querying and authenticatingauthenticate()
method on each clients we obtain a session object that provides the uniquedownload_all
methodA few smaller changes are included:
.netrc
file is now either explicitly given as parameter, or obtained from$NETRC
environment variable (name already used by other tools), or defaulted to~/.netrc
ASFClient
API avoid to query the list of available files again and again. Instead, a cached baseline is used. And only one test queries the lists of available files, and makes sure the new list contains that the cached baseline. EDIT: at this point I missed the usage of VCR. Yet I think we don't need multiple and redundant slow requests when recording the cassettes.ASFClient
are no longer static and global, but local to each client instance -- this way we know exactly what can be expected from each test, and we can compare the EOF list in the cache from the list of remote EOFs available on ASF server.OrbitType
to avoid typo errors and assertions to silence pylint.Finally, there are things that are half-way done as I didn't want to completely modify the current API in case you were explicitly relying on it.
List[str]
in ASF case andList[dict]
in Copernicus dataspace case. Ideally, we would return aList[SentinelOrbit]
which alldownload*()
methods would know how to convert into something that makes sense at their levelClient
level, but I'd prefer to know exactly how you want it donedownload_all
should be atSession
level, and calldownload_one
which would be specialized for each client.