nsidc / .github

1 stars 1 forks source link

Explore data warehousing tools for NSIDC data #8

Open MattF-NSIDC opened 1 year ago

MattF-NSIDC commented 1 year ago

For a typical business, a data warehouse might help business analysts answer questions like "how many sales did we make last week?" via a query to an API or database. Our business is science data, so we may want to ask "How many MB of data did we ingest last week?" or "what files are available for dataset X between dates Y and Z?" or "Where was dataset X migrated to/from and when?" A data warehouse would be the source of truth for information about our dataset inventory.

We have a "data warehouse" system for ECS data in the form of CMR, but the rest of our data is managed solely as items on disk, and it's often not predictable where on disk that data will be or how to find the data you're interested in. Was it migrated to a new datapool recently? How do we determine the date a particular file corresponds with (we currently have to know or discover ourselves "where in the filename is the time?", "what format, e.g. YYYYMMDD, or YYYYDOY, or something else?", "Where are the gaps in coverage?")

It would be useful to provide a service that enables users to:

If we ran a tool like minio in front of all of our datapools, a data warehouse could return S3 URLs for datasets and access could be done over the s3 protocol instead of requiring disk mounts. This would make the transition to the cloud more transparent for our apps.