Open danielfromearth opened 1 month ago
đź‘Ť I think it's a good idea.
For a little additional historical context; there was some discussion around output file naming conventions used for harmony services back in 2020 (https://wiki.earthdata.nasa.gov/display/HARMONY/Output+File+Naming+Convention). At the time the general consensus was that the original filename should be preserved as much as possible. Obviously with concise, that is not exactly feasible because it is a combination of many input files (this is noted on the https://wiki.earthdata.nasa.gov/display/HARMONY/Transformation+service+availability+and+compliance#Transformationserviceavailabilityandcompliance-Servicecompliance page)
As such, as part of this ticket, that wiki page should be updated to specify what the output filename format is.
With respect to start and end timestamps, rather than introduce a dependency on CMR, I wonder if it would be better to just inspect the data in the final output file and find the min/max timestamps in the data itself to use for filename.
Looking at the relevant code for this a bit more, it's seeming to me now like the short name and version number aren't directly accessible in the harmony.message
information or the granules' data themselves. So, unlike min/max timestamps—which, as @frankinspace mentioned, can be retrieved from the granules directly—the short name and version number would only be retrievable with a call to CMR during execution of CONCISE's service adapter. Thanks @frankinspace for referencing the confluence page, because I see now that @bilts also recommended including these bits of information. To implement this cleanly (without a separate call to CMR), would the harmony.message
need to be amended to include the short name and version  information? And would that be a good change to make?
The only alternatives I can currently think of—i.e., to include useful information beyond (just) the ConceptID and timestamps—is to put the full granule name of the first granule along with the number of granules, or to put the names of the first and last granules. Having the full granule names would actually be more analogous to the output filenames from other services, such as l2ss, harmony-netcdf-to-zarr, or net2cog.
So, instead of the current naming, which is:
filename = f'{collection}_merged.nc4'
, the new naming could look something like (with min/max time stamps):
filename = f'{collection}_{datetimes[0].isoformat()}-{datetimes[1].isoformat()}_merged.nc4'
, or (with # granules and first name; note, this is what stitchee
is doing currently):
filename = f"{collection}-concatenated_{number_of_granules}_starting_from_{first_url_name}.nc4"
, or (with first and last names):
filename = f"{collection}_concatenated_granules_from_{first_url_name}_to_{last_url_name}.nc4"
Do any of these look like good approaches? Or are there other ideas?
Also tagging @ank1m, @chris-durbin, and @owenlittlejohns, since this is likely relevant for other and/or future "many-to-one" output services.
I like this naming filename = f'{collection}_{datetimes[0].isoformat()}-{datetimes[1].isoformat()}_merged.nc4' more, but with one correction - currently collection mean C2930726639-LARC_CLOUD, while normal user will not remember in a day that this collection ID means TEMPO_O3TOT_L2_V03. So, it would be great to replace the collection ID by the humanly readable collection full name.
Maybe we can start with {first_url_name}
which seems to include short_name
, version
and first-datetime
already?
Say f'{first_url_name}_{last-datetime.isoformat()}_{collection}_merged.nc4'
?
From discussions with @alexrad71...
Issue
Currently, CONCISE writes the output to a file named with the Collection ID and "_merged.nc", as defined here. This name is not very useful to many users.
Proposed Change
Add both the Collection's short name and the version number to the output filename.
What would be even better :)
Include information about the start and end granules in the output filename. For instance, the start and end timestamps could be retrieved from CMR for the start and end granules, and then converted to
str
, and then added to the output filename.