Determine collection and table versioning strategy

torimcd commented 8 months ago

The collection name is currently hard-coded as a constant. This means a code change is required whenever there is a new data version release, which is perhaps good if we want to wipe the previous data and replace it with new data in the existing table. The API would always point to the latest data. There are alternate patterns we may want to consider:

Create a new db table for new data and keep older versions in their own table for a period of time Would we then want to/be able to support multiple versions of the API, so a user could point to a different base URL if they want to use a previous collection version? What time period would we want to keep older versions of the data for? Need to check with DPub to see what the retirement policy is for SWOT. Once the data in a previous version is no longer available through the archive it shouldn't be available in hydrocron.
Allow loading new data alongside old data in same table This would involve allowing the operator to specify the collection name/version as an input to the lambda function, which would intentionally result in multiple data versions in the same table. The version/collection name would be returned in the API response, but is not a field than can be queried on. New versions of data that have the same reachid/timestamp as existing data will be overwritten by the new version (eg in a reprocessing campaign), but it would be much harder to manually clear out older versions of data if there is ever a need to remove it (eg if there is a collection retired without a new version to replace it). Not sure if that's expected to ever occur.

@cassienickles @frankinspace @nikki-t would appreciate your thoughts

frankinspace commented 8 months ago

My take would be to match as close as we can the plan for data curation. If I recall correctly for swot reprocessing the understanding is that podaac will only have a single collection version available at any given time. I think hydrocron should match that... whatever collection version is available is the version of the data served by hydrocron.

That said, I think we should also try to minimize downtime of the api itself. So I think we should think of it as more of a blue/green style of update where blue = data connected to collection version that is being sunsetted and green = data connected to the new collection version.

The API would be connected to blue tables by default. When a new collection version is being planned; we should create an empty green copy of the tables and connect the green tables to ingest events of the new collection version. So as reprocessed data is delivered to the green collection, it would flow into the green db tables. We could also implement a strategy for manually controlling which tables the api queries agains (easiest thing that comes to mind would simply be a request parameter... maybe undocumented so as not to add confusion?). Once the new collection is in operations and public, we make the green tables the new blue and drop the old blue tables.

frankinspace commented 8 months ago

Along the same line of thought... we could also say that new collection versions always result in an increment to the hydrocron api version (other changes could also increment the api version, not just new collection; but new collection would always increment api version)

That way there could be an overlap period where both data is available via API /hydrocron/v1/timeseries -> blue data /hydrocron/v2/timeseries -> green data

This could be considered too burdensome for our users though as the URL would change. But depending on the nature of the data change it might actually be warranted?

However, that could still be mitigated further by having an additional endpoint that simply points to the 'latest' api version /hydrocron/LATEST/timeseries -> blue data

cassienickles commented 8 months ago

I agree with having it match as close as we can the plan for data curation! And also minimizing downtime of the API!

Adding additional versions of hydrocron and changing the URL makes me feel like that could get really complicated and mistakes could be made... people thinking they're getting the latest and then they aren't. Links and workflows are shared quickly among scientists and whoever made the initial decision on which version to use would be making that decision for everyone they share their link with. I'd prefer to keep it simple and just have one deployed version but could be convinced otherwise :)

frankinspace commented 6 months ago

NASD ticket requesting the ability to serve multiple stages simultaneously from cloudfront: https://bugs.earthdata.nasa.gov/browse/NASD-4190 These are the actual work tickets: https://bugs.earthdata.nasa.gov/browse/NGAP-10685 https://bugs.earthdata.nasa.gov/browse/NGAP-10841

frankinspace commented 4 months ago

Checked with @jjmcnelis; next SWOT reprocessing campaign is currently expected to occur in 25.1. So that should be considered to be the deadline for this feature to be implemented.

I have informed platform of that date in https://bugs.earthdata.nasa.gov/browse/NGAP-10841

torimcd commented 4 months ago

Adding in some info picked up at the SWOT STM this week:

forward stream ingest for the new collection version will begin in October in 24.4, prior to reprocessing prior data starting in 25.1. I think if that rushes us too much, or we're still waiting on the NASD capability, we can wait and backfill again in the same way we did for the v1 release.
consensus among scientists/users of hydrocron that I've spoken with so far prefer exactly the incremented API version with a period where both the both versions are available as @frankinspace described above:

That way there could be an overlap period where both data is available via API
/hydrocron/v1/timeseries -> blue data
/hydrocron/v2/timeseries -> green data

This way they can still get a full time series by making calls to both tables. Since reprocessing is expected to take months, we don't want to remove the previous version too early. However, when the SWORD version increments, the reach and node ids often change, and could be recycled, so the same reach_id in v1 and v2 could actually be pointing to different features. This table/url design forces users to think about that and handle it in their applications.

frankinspace commented 4 months ago

Thanks for the update @torimcd

Will the forward stream be made public immediately or will there be a delay?

It sounds like the consensus is that we need to provide for some period of overlap where users can query both versions of the data. This makes the back-end decision pretty straightforward (IMO), we should maintain tables-per-collection. Theoretically we could add another global secondary index on the 'collection_shortname' / 'collection_version' columns in the existing tables but I don't think that would provide any benefit over tables-per-collection at this point (@torimcd do you agree?)

I do want to explore the new endpoint vs request parameter thought a little more. We don't necessarily need new endpoints to achieve the goal of serving multiple collection versions.

The multiple endpoint solution would look something like:

_/hydrocron/v1/timeseries -> SWOT_L2_HR_RiverSP_2.0 data /hydrocron/v2/timeseries -> SWOT_L2_HR_RiverSPTBD data /hydrocron/LATEST/timeseries -> 2.0 or TBD data, controlled by podaac operations

However, same functionality could be achieved by introducing a new request parameter (e.g. 'collection', could be 'short_name' or 'data_version' or something else)

_/hydrocron/v1/timeseries?collection=C2799438299-POCLOUD -> SWOT_L2_HR_RiverSP_2.0 data /hydrocron/v1/timeseries?collection=CTBD-POCLOUD -> SWOT_L2_HR_RiverSPTBD data /hydrocron/v1/timeseries?collection=latest -> 2.0 or TBD data, controlled by podaac operations

Here's just a few thoughts on the pros/cons of each approach

Solution 1 (new data = new endpoint):

Pro: Obvious that data is changing, forces clients to be cognizant of that fact
Pro: Simple to track usage metrics, podaac would know which endpoint is receiving traffic
Pro: More flexibility with how retirement of 'old' endpoint is managed. Could setup a redirect from v1 to v2, or could return 501 Not Implemented with a message that v1 is not supported and link to v2 docs
Con: Need to manage multiple stages(?), going to take some research to see how this is achieved in terraform/API Gateway though I believe it is possible
Con: Must deploy new API version when data changes, even if there is no change to API specification

Solution 2 (add request parameter to specify collection used):

Pro: Minimal impact to clients, URL does not change, just a query parameter
Pro: Can support multiple data versions with no change to API specification
Pro: Collections supported could be managed via UMM-S/UMM-C association
Pro: Breaking API changes can be decoupled from data management. Meaning /hydrocron/v2/timeseries would be reserved for if/when we have API specification changes that are not backward-compatible
Con: More difficult to track metrics, would require parsing request params (which we are planning on doing anyway but is blocked for other reasons).
Con: Air is a great movie

Thinking through this I'm leaning towards managing the data via request parameter instead of new endpoints but welcome other contributions too.

torimcd commented 4 months ago

Thanks @frankinspace:

There's always the chance of a delay, but I think we should plan assuming that the new data version will be public ~immediately (within Oct, but perhaps not the same day data starts ingesting).

I agree we should go with tables-per-collection. I think using another GSI to work around this only adds complexity, and effectively duplicates the data anyways, so doesn't save on storage.

In general using a query parameter vs a different endpoint has the same impact on the end user for having to deal with versioning, in that they have to make different API requests to get different versions of the data in both cases, so from that sense I'm fine with either approach. I like decoupling the API versioning from the data versioning, so am onboard with going forward with the request parameter. I don't love using shortname or concept ID as the parameter though, since that starts needing a bit more technical knowledge. Also need to think a little about backward compatibility here, multiple endpoints is already backward compatible as it's already deployed as v1, but I don't know how we could implement this through request parameters without it being a required parameter (otherwise at what point do we switch the 'default'?), so introducing this will be a breaking change.

cassienickles commented 2 months ago

Jack's opinion: the blue/green idea described in this comment gets my vote, if we can pull it off: https://github.com/podaac/hydrocron/issues/103#issuecomment-2181817817 If blue is version C and green is version D, then in October blue stops growing and green starts growing from zero, and green will not have the complete data record until mid-2025 optimistically

torimcd commented 1 month ago

Steps needed to implement table versioning based on request parameter:

create new dynamodb tables for each feature type (reach, node, prior_lake), table name contains data version (ie hydrocron-swot-reach-table-d or hydrocron-swot-reach-table-version-d)
create new API request parameter for user to specify data version (data_version=d?, confirm with dpub/app sci what this should be), query correct table based on parameter requested. Should this parameter be required?
Do we need a separate track ingest table for each feature per version? Once the fwd stream turns on there will be no more data added to the older tables, do we just switch to tracking the new version in the existing tables?
update documentation
deploy new hydrocron version to UAT, test, deploy to OPS
Dependency on OPS: back-load any data in new table if new version forward stream data release date is missed - do we know what date this is yet?
Dependency on I&A: connect forward stream to send CNM to hydrocron on ingest
Potential feature analysis for removing tables/validating query parameters on collection version retirement

nikki-t commented 3 weeks ago

I know that we don't need our API gateway to support multiple stages for this work but I followed up on the tickets listed in this comment and it looks like NGAP will be able to support this work in 24.4.3 starting in the SIT environment.

Tickets:

podaac / hydrocron

Determine collection and table versioning strategy #103