Open andypbarrett opened 5 months ago
This has a smell to me. Why are we doing this post-process step to calculate if all granules are cloud-hosted based on the first granule, instead of having the DataGranule class set cloud_hosted for each granule at init-time? It would be great to put that rationale in a comment for future us.
If there's good rationale, or if we're looking for a short-term fix, I think we can add a quick:
if not results:
return []
to resolve the bug.
This has a smell to me. Why are we doing this post-process step to calculate if all granules are cloud-hosted based on the first granule, instead of having the DataGranule class set cloud_hosted for each granule at init-time? It would be great to put that rationale in a comment for future us.
If there's good rationale, or if we're looking for a short-term fix, I think we can add a quick:
if not results: return []
to resolve the bug.
I would suggest simply changing the assignment to cloud
to this to fix the IndexError
:
cloud = len(response) > 0 and self._is_cloud_hosted(response[0])
However, I think this reveals another bug. It is erroneous to assume that if response[0]
is (or is not) cloud hosted, then the rest of the granules in the response are (or are not) cloud hosted as well. If the query does not specify the cloud_hosted
parameter, then the results could possibly be a mix of cloud hosted and non-cloud hosted granules.
For example, it is valid to specify multiple collection concept IDs in a CMR granule search query. If one collection is cloud hosted and the other is not, then the granules will be a mix of cloud hosted and non-cloud hosted.
Therefore, I believe the proper fix to address both bugs (IndexError and wrong assumption) is to completely remove the line that assigns a value to cloud
and then change the return
statement to this:
return [DataGranule(granule, cloud_hosted=self._is_cloud_hosted(granule)) for granule in response]
However, I think this reveals another bug. It is erroneous to assume that if
response[0]
is (or is not) cloud hosted, then the rest of the granules in the response are (or are not) cloud hosted as well. If the query does not specify thecloud_hosted
parameter, then the results could possibly be a mix of cloud hosted and non-cloud hosted granules.
:100: This is the smell I was trying to communicate, thanks for explaining it better than me!
Therefore, I believe the proper fix to address both bugs (IndexError and wrong assumption) is to completely remove the line that assigns a value to
cloud
and then change thereturn
statement to this:return [DataGranule(granule, cloud_hosted=self._is_cloud_hosted(granule)) for granule in response]
Since DataGranule
constructor and _is_cloud_hosted
both depend on this granule
object (loaded JSON), what do you think of encapsulating that logic within DataGranule
and moving the _is_cloud_hosted
method from DataGranules
to DataGranule
?
Therefore, I believe the proper fix to address both bugs (IndexError and wrong assumption) is to completely remove the line that assigns a value to
cloud
and then change thereturn
statement to this:return [DataGranule(granule, cloud_hosted=self._is_cloud_hosted(granule)) for granule in response]
Since
DataGranule
constructor and_is_cloud_hosted
both depend on thisgranule
object (loaded JSON), what do you think of encapsulating that logic withinDataGranule
and moving the_is_cloud_hosted
method fromDataGranules
toDataGranule
?
Agreed that _is_cloud_hosted
makes sense to be moved to DataGranule
. I would say there's also no need to make it "private" either (i.e., drop the underscore prefix), and should perhaps even be a cached property (from functools
) on DataGranule
, similar to this (I also suggest dropping the is
prefix, but not utterly opposed to keeping it):
@cached_property
def cloud_hosted(self) -> bool:
...
I might also suggest dropping the cloud_hosted
parameter of DataGranule
, but I don't know if there's a specific reason that's there. Perhaps there are cases where we want to force cloud_hosted
to be True
even though the logic currently in _is_cloud_hosted
would evaluate to False
, so it serves as an override in the "I know what I'm doing" sense.
cc: @betolink
The cloud_hosted
parameter enables searching for only cloud-hosted data when cloud_hosted=True
and only DAAC-hosted (a.k.a on-prem) when cloud_hosted=False
.
The
cloud_hosted
parameter enables searching for only cloud-hosted data whencloud_hosted=True
and only DAAC-hosted (a.k.a on-prem) whencloud_hosted=False
.
I'm referring to the cloud_hosted
param to DataGranule
(singular, not plural).
Agreed that
_is_cloud_hosted
makes sense to be moved toDataGranule
. I would say there's also no need to make it "private" either (i.e., drop the underscore prefix), and should perhaps even be a cached property (fromfunctools
) onDataGranule
:100:
I might also suggest dropping the
cloud_hosted
parameter ofDataGranule
, but I don't know if there's a specific reason that's there. Perhaps there are cases where we want to forcecloud_hosted
to beTrue
even though the logic currently in_is_cloud_hosted
would evaluate toFalse
, so it serves as an override in the "I know what I'm doing" sense.
Anybody know why DataGranule
takes a cloud_hosted
parameter?
With my current understanding, it doesn't seem to make sense to me to explicitly specify such a parameter, particularly if we move the logic in DataGranules._is_cloud_hosted
to a new attribute DataGranule.cloud_hosted
. Note that we're moving logic from DataGranules
(plural) to DataGranule
(singular).
I would think that with such a move, we would want to deprecate the cloud_hosted
parameter to DataGranule.__init__
.
I might also suggest dropping the
cloud_hosted
parameter ofDataGranule
, but I don't know if there's a specific reason that's there. Perhaps there are cases where we want to forcecloud_hosted
to beTrue
even though the logic currently in_is_cloud_hosted
would evaluate toFalse
, so it serves as an override in the "I know what I'm doing" sense.Anybody know why
DataGranule
takes acloud_hosted
parameter?With my current understanding, it doesn't seem to make sense to me to explicitly specify such a parameter, particularly if we move the logic in
DataGranules._is_cloud_hosted
to a new attributeDataGranule.cloud_hosted
. Note that we're moving logic fromDataGranules
(plural) toDataGranule
(singular).I would think that with such a move, we would want to deprecate the
cloud_hosted
parameter toDataGranule.__init__
.
@betolink, do you have any background knowledge on why this is implemented this way?
Sorry for the delayed response, there are collections that exist both in the DAAC and AWS with the same short_name
. cloud_hosted
is a flag to indicate that we can use the S3 links. The ultimate reason is to simplify open()
and download()
so it defaults to S3 if the code is running in us-west-2
. We could check for the S3 links in "RelatedUrls" (umm) but injecting a flag from the search results was faster if I recall. @chuckwondo
Sorry for the delayed response, there are collections that exist both in the DAAC and AWS with the same
short_name
.cloud_hosted
is a flag to indicate that we can use the S3 links. The ultimate reason is to simplifyopen()
anddownload()
so it defaults to S3 if the code is running inus-west-2
. We could check for the S3 links in "RelatedUrls" (umm) but injecting a flag from the search results was faster if I recall. @chuckwondo
Sure, injecting a flag would be faster (although I suspect this would be negligible, thus a questionable "optimization"), but more importantly, it's possibly simply wrong, per an earlier comment I made:
However, I think this reveals another bug. It is erroneous to assume that if response[0] is (or is not) cloud hosted, then the rest of the granules in the response are (or are not) cloud hosted as well. If the query does not specify the cloud_hosted parameter, then the results could possibly be a mix of cloud hosted and non-cloud hosted granules.
For example, it is valid to specify multiple collection concept IDs in a CMR granule search query. If one collection is cloud hosted and the other is not, then the granules will be a mix of cloud hosted and non-cloud hosted.
Observations added from duplicate #816:
we better also make sure that
earthaccess.download
doesn't choke when given [].perhaps when [
earthaccess.search_data
] returns an empty list we should log a warning?
As a workaround, I used the following:
try:
granules = earthaccess.search_data(concept_id = id, temporal = (start_date,end_date))
except: print("no granules found")
What Happens
If I do a simple search for granules that returns zero hits, I get an index error.
What I expect
I would expect this to return an empty list as print
Granules found: 0
to stdout.Proposed change
Check that response is not empty before checking
_is_cloud_hosted
. Return empty list if it is empty.