podaac / concise

CONCISE (CONCatenatIon SErvice)
https://podaac.github.io/concise
Apache License 2.0
7 stars 4 forks source link

Improve name of output file #117

Open danielfromearth opened 1 month ago

danielfromearth commented 1 month ago

From discussions with @alexrad71...

Issue

Currently, CONCISE writes the output to a file named with the Collection ID and "_merged.nc", as defined here. This name is not very useful to many users.

Proposed Change

Add both the Collection's short name and the version number to the output filename.

What would be even better :)

Include information about the start and end granules in the output filename. For instance, the start and end timestamps could be retrieved from CMR for the start and end granules, and then converted to str, and then added to the output filename.

frankinspace commented 1 month ago

đź‘Ť I think it's a good idea.

For a little additional historical context; there was some discussion around output file naming conventions used for harmony services back in 2020 (https://wiki.earthdata.nasa.gov/display/HARMONY/Output+File+Naming+Convention). At the time the general consensus was that the original filename should be preserved as much as possible. Obviously with concise, that is not exactly feasible because it is a combination of many input files (this is noted on the https://wiki.earthdata.nasa.gov/display/HARMONY/Transformation+service+availability+and+compliance#Transformationserviceavailabilityandcompliance-Servicecompliance page)

As such, as part of this ticket, that wiki page should be updated to specify what the output filename format is.

With respect to start and end timestamps, rather than introduce a dependency on CMR, I wonder if it would be better to just inspect the data in the final output file and find the min/max timestamps in the data itself to use for filename.

danielfromearth commented 1 month ago

Looking at the relevant code for this a bit more, it's seeming to me now like the short name and version number aren't directly accessible in the harmony.message information or the granules' data themselves. So, unlike min/max timestamps—which, as @frankinspace mentioned, can be retrieved from the granules directly—the short name and version number would only be retrievable with a call to CMR during execution of CONCISE's service adapter. Thanks @frankinspace for referencing the confluence page, because I see now that @bilts also recommended including these bits of information. To implement this cleanly (without a separate call to CMR), would the harmony.message need to be amended to include the short name and version  information? And would that be a good change to make?

The only alternatives I can currently think of—i.e., to include useful information beyond (just) the ConceptID and timestamps—is to put the full granule name of the first granule along with the number of granules, or to put the names of the first and last granules. Having the full granule names would actually be more analogous to the output filenames from other services, such as l2ss, harmony-netcdf-to-zarr, or net2cog.

So, instead of the current naming, which is:

filename = f'{collection}_merged.nc4'

, the new naming could look something like (with min/max time stamps):

filename = f'{collection}_{datetimes[0].isoformat()}-{datetimes[1].isoformat()}_merged.nc4'

, or (with # granules and first name; note, this is what stitchee is doing currently):

filename = f"{collection}-concatenated_{number_of_granules}_starting_from_{first_url_name}.nc4"

, or (with first and last names):

filename = f"{collection}_concatenated_granules_from_{first_url_name}_to_{last_url_name}.nc4"

Do any of these look like good approaches? Or are there other ideas?

Also tagging @ank1m, @chris-durbin, and @owenlittlejohns, since this is likely relevant for other and/or future "many-to-one" output services.

alexrad71 commented 1 month ago

I like this naming filename = f'{collection}_{datetimes[0].isoformat()}-{datetimes[1].isoformat()}_merged.nc4' more, but with one correction - currently collection mean C2930726639-LARC_CLOUD, while normal user will not remember in a day that this collection ID means TEMPO_O3TOT_L2_V03. So, it would be great to replace the collection ID by the humanly readable collection full name.

ank1m commented 1 month ago

Maybe we can start with {first_url_name} which seems to include short_name, version and first-datetime already? Say f'{first_url_name}_{last-datetime.isoformat()}_{collection}_merged.nc4'?