Open torimcd opened 8 months ago
My take would be to match as close as we can the plan for data curation. If I recall correctly for swot reprocessing the understanding is that podaac will only have a single collection version available at any given time. I think hydrocron should match that... whatever collection version is available is the version of the data served by hydrocron.
That said, I think we should also try to minimize downtime of the api itself. So I think we should think of it as more of a blue/green style of update where blue = data connected to collection version that is being sunsetted and green = data connected to the new collection version.
The API would be connected to blue tables by default. When a new collection version is being planned; we should create an empty green copy of the tables and connect the green tables to ingest events of the new collection version. So as reprocessed data is delivered to the green collection, it would flow into the green db tables. We could also implement a strategy for manually controlling which tables the api queries agains (easiest thing that comes to mind would simply be a request parameter... maybe undocumented so as not to add confusion?). Once the new collection is in operations and public, we make the green tables the new blue and drop the old blue tables.
Along the same line of thought... we could also say that new collection versions always result in an increment to the hydrocron api version (other changes could also increment the api version, not just new collection; but new collection would always increment api version)
That way there could be an overlap period where both data is available via API /hydrocron/v1/timeseries -> blue data /hydrocron/v2/timeseries -> green data
This could be considered too burdensome for our users though as the URL would change. But depending on the nature of the data change it might actually be warranted?
However, that could still be mitigated further by having an additional endpoint that simply points to the 'latest' api version /hydrocron/LATEST/timeseries -> blue data
I agree with having it match as close as we can the plan for data curation! And also minimizing downtime of the API!
Adding additional versions of hydrocron and changing the URL makes me feel like that could get really complicated and mistakes could be made... people thinking they're getting the latest and then they aren't. Links and workflows are shared quickly among scientists and whoever made the initial decision on which version to use would be making that decision for everyone they share their link with. I'd prefer to keep it simple and just have one deployed version but could be convinced otherwise :)
NASD ticket requesting the ability to serve multiple stages simultaneously from cloudfront: https://bugs.earthdata.nasa.gov/browse/NASD-4190 These are the actual work tickets: https://bugs.earthdata.nasa.gov/browse/NGAP-10685 https://bugs.earthdata.nasa.gov/browse/NGAP-10841
Checked with @jjmcnelis; next SWOT reprocessing campaign is currently expected to occur in 25.1. So that should be considered to be the deadline for this feature to be implemented.
I have informed platform of that date in https://bugs.earthdata.nasa.gov/browse/NGAP-10841
Adding in some info picked up at the SWOT STM this week:
consensus among scientists/users of hydrocron that I've spoken with so far prefer exactly the incremented API version with a period where both the both versions are available as @frankinspace described above:
That way there could be an overlap period where both data is available via API
/hydrocron/v1/timeseries -> blue data
/hydrocron/v2/timeseries -> green data
This way they can still get a full time series by making calls to both tables. Since reprocessing is expected to take months, we don't want to remove the previous version too early. However, when the SWORD version increments, the reach and node ids often change, and could be recycled, so the same reach_id in v1 and v2 could actually be pointing to different features. This table/url design forces users to think about that and handle it in their applications.
Thanks for the update @torimcd
Will the forward stream be made public immediately or will there be a delay?
It sounds like the consensus is that we need to provide for some period of overlap where users can query both versions of the data. This makes the back-end decision pretty straightforward (IMO), we should maintain tables-per-collection. Theoretically we could add another global secondary index on the 'collection_shortname' / 'collection_version' columns in the existing tables but I don't think that would provide any benefit over tables-per-collection at this point (@torimcd do you agree?)
I do want to explore the new endpoint vs request parameter thought a little more. We don't necessarily need new endpoints to achieve the goal of serving multiple collection versions.
The multiple endpoint solution would look something like:
_/hydrocron/v1/timeseries -> SWOT_L2_HR_RiverSP_2.0 data /hydrocron/v2/timeseries -> SWOT_L2_HR_RiverSPTBD data /hydrocron/LATEST/timeseries -> 2.0 or TBD data, controlled by podaac operations
However, same functionality could be achieved by introducing a new request parameter (e.g. 'collection', could be 'short_name' or 'data_version' or something else)
_/hydrocron/v1/timeseries?collection=C2799438299-POCLOUD -> SWOT_L2_HR_RiverSP_2.0 data /hydrocron/v1/timeseries?collection=CTBD-POCLOUD -> SWOT_L2_HR_RiverSPTBD data /hydrocron/v1/timeseries?collection=latest -> 2.0 or TBD data, controlled by podaac operations
Here's just a few thoughts on the pros/cons of each approach
Solution 1 (new data = new endpoint):
Solution 2 (add request parameter to specify collection used):
Thinking through this I'm leaning towards managing the data via request parameter instead of new endpoints but welcome other contributions too.
Thanks @frankinspace:
There's always the chance of a delay, but I think we should plan assuming that the new data version will be public ~immediately (within Oct, but perhaps not the same day data starts ingesting).
I agree we should go with tables-per-collection. I think using another GSI to work around this only adds complexity, and effectively duplicates the data anyways, so doesn't save on storage.
In general using a query parameter vs a different endpoint has the same impact on the end user for having to deal with versioning, in that they have to make different API requests to get different versions of the data in both cases, so from that sense I'm fine with either approach. I like decoupling the API versioning from the data versioning, so am onboard with going forward with the request parameter. I don't love using shortname or concept ID as the parameter though, since that starts needing a bit more technical knowledge. Also need to think a little about backward compatibility here, multiple endpoints is already backward compatible as it's already deployed as v1, but I don't know how we could implement this through request parameters without it being a required parameter (otherwise at what point do we switch the 'default'?), so introducing this will be a breaking change.
Jack's opinion: the blue/green idea described in this comment gets my vote, if we can pull it off: https://github.com/podaac/hydrocron/issues/103#issuecomment-2181817817 If blue is version C and green is version D, then in October blue stops growing and green starts growing from zero, and green will not have the complete data record until mid-2025 optimistically
Steps needed to implement table versioning based on request parameter:
I know that we don't need our API gateway to support multiple stages for this work but I followed up on the tickets listed in this comment and it looks like NGAP will be able to support this work in 24.4.3 starting in the SIT environment.
Tickets:
The collection name is currently hard-coded as a constant. This means a code change is required whenever there is a new data version release, which is perhaps good if we want to wipe the previous data and replace it with new data in the existing table. The API would always point to the latest data. There are alternate patterns we may want to consider:
Create a new db table for new data and keep older versions in their own table for a period of time Would we then want to/be able to support multiple versions of the API, so a user could point to a different base URL if they want to use a previous collection version? What time period would we want to keep older versions of the data for? Need to check with DPub to see what the retirement policy is for SWOT. Once the data in a previous version is no longer available through the archive it shouldn't be available in hydrocron.
Allow loading new data alongside old data in same table This would involve allowing the operator to specify the collection name/version as an input to the lambda function, which would intentionally result in multiple data versions in the same table. The version/collection name would be returned in the API response, but is not a field than can be queried on. New versions of data that have the same reachid/timestamp as existing data will be overwritten by the new version (eg in a reprocessing campaign), but it would be much harder to manually clear out older versions of data if there is ever a need to remove it (eg if there is a collection retired without a new version to replace it). Not sure if that's expected to ever occur.
@cassienickles @frankinspace @nikki-t would appreciate your thoughts