podaac / hydrocron

API for retreiving time series of SWOT data
https://podaac.github.io/hydrocron/
Apache License 2.0
18 stars 4 forks source link

Features with large geometries cannot be loaded in Hydrocron with current methods #210

Closed torimcd closed 3 months ago

torimcd commented 4 months ago

SWOT Lake data has much larger and more complex geometries (polygons) than Rivers (lines) or Nodes (points). Anecdotally, lake polygon geometries can have ~50,000+ vertices. DynamoDB has an item size limit of 400 KB. These very large geometries can exceed the item limit and cause the database load to fail.

We have a few options for how to handle this:

  1. Store the full items as json objects in S3, with the dynamodb entry containing just the lake_id, time, and index to the S3 record: https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-use-s3-too.html

  2. Simplify the polygons, which will require deciding on and documenting the tolerance value to use as well as any impact to topology: https://geopandas.org/en/stable/docs/reference/api/geopandas.GeoSeries.simplify.html

  3. Store only the centerpoints to continue supporting geojson as a return type, but otherwise not promise to return the geometry field at all. My thinking here is that since Hydrocron is not designed to support direct visualization of the features (which will eventually be available through EGIS), I'm not sure if there's value in returning a simplified geometry over no geometry for prior lakes specifically.

A consideration is that ~geometries for prior lakes can be found in the PLD (SWORD equivalent)~ reference geometries are in the PLD but the geometries in the prior shapefiles do not necessarily match these ; whatever we decide here will also affect how we implement support for observed and unassigned lakes in the future, where the geometries can only be found in the source shapefiles.

Will need to gather stakeholder input on specific use cases, also curious to hear thoughts from @cassienickles @ScienceCat18 @frankinspace @nikki-t

ScienceCat18 commented 4 months ago

My initial thoughts are that returning the full geometries of the lake feature would be important to end users, even if their end goal is not visualization. But I think the best thing to do is to ask our early adopters at the USGS/OSU/CUAHSI WG meeting, to get a better sense for the applications of the service (use cases) with lakes data. USGS mentioned they've been getting questions about lakes (even more so than rivers), so I am hoping they can provide some initial guidance on use cases. It could turn out that simplifying the geometry is sufficient for the API applications, and the WFS service (still to come) will facilitate the use cases where full geometries are needed.

frankinspace commented 4 months ago

We discussed two possibilities:

  1. Delay deployment of lakes in favor of returning full geometry. Realistically this would push back any support for Lakes to FY25
  2. Continue with lake implementation and delay support for full lake geometry. We can open a feature analysis for supporting full lake geometry and discuss where to prioritize that against other features

Decided that the second option was preferred so that we can get the lake data in front of users quicker.

Further discussed option 2 had a few options for implementation right now:

We agreed geojson output was one of the key features of the API and should be maintained. This left the decision of how to 'simplify' the geometry.

Trying to calculate a simple geometry consistently for every lake (and documenting how that simplification is done) is too large of scope.

Decision is to instead do a center point calculation on the polygon geometry and return the centerpoint only.

Once we do analysis for how to handle the full geometry, we can implement an API version increment so we can easily document that Version X of the API returns center points of lakes and Version X+1 returns full geometries.

torimcd commented 3 months ago

Documenting some additional findings: