granules table schema - phase II

scisco commented 7 years ago

GranulesTable

fieldName	Type	Example
granuleId	String (range)	Granule's unique ID in cumulus
collectionName	String (hash)	Granule's collection
granuleShortName	String	Granule's unique ID in CMR
files	Object	List of files and their locations
recipe	Object	The recipe the granule is processed with
readyForProcess	Number	The granule is ready for process (when it has the files needed for processing)
processedAt	Number	The time the recipe processing is completed (remains null if the processing fails)
pushedToCMRAt	Number	The time the metadata is pushed to the CMR
archivedAt	Number	The time the archiving is complete and the granule is ready for distribution
failedAt	Number	The time the processing of the granule failed
createdAt	Number	The time the record is created
updatedAt	Number	The time the record is updated

Example Granule Record

{
  "granuleId": "D20120907_022056_P",
  "collectionName": "AVAPS",
  "granuleShortName": "D20120907_022056_P",
  "files": {
    "D20120907_022056_P.QC.PresCorrQC.nc": {
      "originalFile": "ftp://hs3.nsstc.nasa.gov/pub/hs3/AVAPS/data/2012/txt/0907/D20120907_022056_P.PresCorrQC.nc",
      "stagingFile": "s3://cumulus-staging/ghrc/avaps/D20120907_022056_P.PresCorrQC.nc",
      "archivedFile": "s3://cumulus-protected/ghrc/avaps/D20120907_022056_P.PresCorrQC.nc"
    },
    "D20120907_022056_P.QC.eol": {
      "stagingFile": "s3://cumulus-staging/ghrc/avaps/D20120907_022056_P.QC.nc",
      "archivedFile": "s3://cumulus-protected/ghrc/avaps/D20120907_022056_P.QC.nc"
    },
    "D20120907_022056_P.meta": {
      "stagingFile": "s3://cumulus-staging/ghrc/avaps/D20120907_022056_P.meta",
      "archivedFile": "s3://cumulus-private/ghrc/avaps/D20120907_022056_P.meta"
    }

  },
  "recipe": [
    {
      "name": "process",
      "config": {
        "steps": [{
          "description": "Generate NetCDF",
          "image": "985962406024.dkr.ecr.us-east-1.amazonaws.com/cumulus-data-acquisition:latest",
          "cmd": "start.sh"
        }]
      }
    },
    {
      "name": "pushToCMR",
      "config": {}
    },
    {
      "name": "archive",
      "config": {}
    }
  ],
  "readyForProcess": 1480962569,
  "processedAt": 1480962569,
  "pushedToCMRAt": 1480962569,
  "archivedAt": 1480962569,
  "createdAt": 1480962569,
  "updatedAt": 1480962569
}

scisco commented 7 years ago

@rhartran for your review

rhartran commented 7 years ago

Is the granuleId assigned by Cumulus? Or, does it come from the data provider?

rhartran commented 7 years ago

What time format is this using? How hard is it to convert to a local time when viewing? Would it be easier to use "date" data type?

rhartran commented 7 years ago

Would it help to introduce a "file type" in the files section? This could match the file type in the PDR and make it easier for jobs like: tell me the URL for the metadata file, or browse file.

scisco commented 7 years ago

@rhartran my responses below:

Is the granuleId assigned by Cumulus? Or, does it come from the data provider?

It is extracted from the filename using a Regex set by the operator. Obviously there is an assumption here that all files of a collection follow a specific naming convention.

What time format is this using? How hard is it to convert to a local time when viewing? Would it be easier to use "date" data type?

We use epoch unix time, because DynamoDb doesn't have the concept of date. It supports string and number. The conversion to local time shouldn't be much of a problem. unix time handles it quite well.

Would it help to introduce a "file type" in the files section? This could match the file type in the PDR and make it easier for jobs like: tell me the URL for the metadata file, or browse file.

It does help. Although I think the right place for the fileType would be in the collectionTable not in the granuleTable. What do you think?

rhartran commented 7 years ago

Comments preceeded by RMH:

Bob

It is extracted from the filename using a Regex set by the operator. Obviously there is an assumption here that all files of a collection follow a specific naming convention.

RMH: We should check with the DAACs to make sure that is OK. I think that algorithm would not support granule replacement scenarios where Cumulus needs to keep the old instance of the granule and add the new instance also. For this use case, SDPS offers options to automatically replace the granule in the public/protected directory and move the old instance to a hidden/private directoy – or keep the new instance in hidden and let the DAAC manually choose to replace the old instance. Another less likely case is that files from different collections map to the same granuleId given the regex. I think many systems assign their own “unintelligent” key to each granule as it is received.

It does help. Although I think the right place for the fileType would be in the collectionTable not in the granuleTable. What do you think?

RMH: I think some of it depends how you model a granule. If you model a granule as a set of files (science, browse, metadata, QA, PH) then having a file type associated with each file in the granule seems like an important thing to have. It will allow Cumulus to offer more service when distributing files. If you model a granule as just the science file and have other entities such as browse, metadata, QA, and PH that are associated with the science granule then storing a file type is not necessary.

nasa / cumulus-api

granules table schema - phase II #18

GranulesTable

Example Granule Record