Features with large geometries cannot be loaded in Hydrocron with current methods

SWOT Lake data has much larger and more complex geometries (polygons) than Rivers (lines) or Nodes (points). Anecdotally, lake polygon geometries can have ~50,000+ vertices. DynamoDB has an item size limit of 400 KB. These very large geometries can exceed the item limit and cause the database load to fail.

We have a few options for how to handle this:

Store the full items as json objects in S3, with the dynamodb entry containing just the lake_id, time, and index to the S3 record: https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-use-s3-too.html
Simplify the polygons, which will require deciding on and documenting the tolerance value to use as well as any impact to topology: https://geopandas.org/en/stable/docs/reference/api/geopandas.GeoSeries.simplify.html
Store only the centerpoints to continue supporting geojson as a return type, but otherwise not promise to return the geometry field at all. My thinking here is that since Hydrocron is not designed to support direct visualization of the features (which will eventually be available through EGIS), I'm not sure if there's value in returning a simplified geometry over no geometry for prior lakes specifically.

A consideration is that ~geometries for prior lakes can be found in the PLD (SWORD equivalent)~ reference geometries are in the PLD but the geometries in the prior shapefiles do not necessarily match these ; whatever we decide here will also affect how we implement support for observed and unassigned lakes in the future, where the geometries can only be found in the source shapefiles.

Will need to gather stakeholder input on specific use cases, also curious to hear thoughts from @cassienickles @ScienceCat18 @frankinspace @nikki-t

My initial thoughts are that returning the full geometries of the lake feature would be important to end users, even if their end goal is not visualization. But I think the best thing to do is to ask our early adopters at the USGS/OSU/CUAHSI WG meeting, to get a better sense for the applications of the service (use cases) with lakes data. USGS mentioned they've been getting questions about lakes (even more so than rivers), so I am hoping they can provide some initial guidance on use cases. It could turn out that simplifying the geometry is sufficient for the API applications, and the WFS service (still to come) will facilitate the use cases where full geometries are needed.

We discussed two possibilities:

Delay deployment of lakes in favor of returning full geometry. Realistically this would push back any support for Lakes to FY25
Continue with lake implementation and delay support for full lake geometry. We can open a feature analysis for supporting full lake geometry and discuss where to prioritize that against other features

Decided that the second option was preferred so that we can get the lake data in front of users quicker.

Further discussed option 2 had a few options for implementation right now:

Disable geojson output entirely because geometry is a required element of geojson. This would mean only csv output would be supported for prior lake data
Leave geojson enabled but 'simplify' the geometry returned.

We agreed geojson output was one of the key features of the API and should be maintained. This left the decision of how to 'simplify' the geometry.

Calculate center point and return just center point
Calculation a 'simpler' geometry (fewer polygon edges) and return that

Trying to calculate a simple geometry consistently for every lake (and documenting how that simplification is done) is too large of scope.

Decision is to instead do a center point calculation on the polygon geometry and return the centerpoint only.

Once we do analysis for how to handle the full geometry, we can implement an API version increment so we can easily document that Version X of the API returns center points of lakes and Version X+1 returns full geometries.

Documenting some additional findings:

PLD geometries are not in the prior lakes shapefiles. The geometries are what SWOT observed, so they change every pass.
Prior lake geometries are determined by whether the observed geometry intersects the PLD geometry: https://github.com/CNES/swot-hydrology-toolbox/blob/9c72c3522f2d884c089e0576763652aa33425234/processing/src/cnes/common/lib_lake/proc_lake.py#L711
The center point that we calculate will then be the center point of the observed geometry, not of the PLD geometry.
We will want to eventually store both the full observed geometry from the obs_lake shapefiles as well as the observed geometries from the prior_lake shapefiles - they are related but different.
If a lake in the PLD was not observed for whatever reason, the prior_lakes shapefiles still contain a record for the lake id from the PLD, but it has a null geometry assigned. This will break things when we try to return geojson. To handle this I think our options are to either a) not write these features to the hydrocron database, or b) use the centerpoint from the PLD geometries. a) will be much more straightforward to implement, and since we have already made design decisions that prioritize getting data out sooner I think this would be ok for this version, but it is a change in behavior from how we chose to handle the same issue for time with rivers. No time is recorded when there wasn't an observation during a satellite pass for rivers, but we still include the record with the start time of the pass so that users can choose how to deal with the null observation. A null observation during an otherwise valid pass is a different situation than no pass during a time of interest, and users may want to handle those differently. ** edit: I suppose we also have the option of c) writing the null geometries to the prior-lakes table now, and choosing to remove/ignore those values in the API code. We will want to decide in the future if we do a lookup for the PLD geometries in the API code or before writing to the DB. My hunch is writing the geometries we want to return to the DB would be more performant, but then have no way to indicate to the user that geometries are from the PLD vs observed by SWOT.

podaac / hydrocron

Features with large geometries cannot be loaded in Hydrocron with current methods #210