tagbase / tagbase-server

tagbase-server is a data management web service for working with eTUFF and nc-eTAG files.
https://oiip.jpl.nasa.gov/doc/OIIP_Deliverable7.4_TagbasePostgreSQLeTUFF_UserGuide.pdf
Apache License 2.0
7 stars 2 forks source link

Rearchitect sp_execute_data_migration stored procedure to run in memory #160

Open lewismc opened 1 year ago

lewismc commented 1 year ago

NOTE AS OF 2023-06-12 this issue scope changed but has been kept open and updated. See the bottom of the thread for more context.

We should consider areas for improvement regarding interactions with PostgreSQL. Specifically, the data migration which takes place in the TRIGGER. The goal is to optimize data movement upon ingestion.

lewismc commented 1 year ago

We should also investigate using pg_bulkload

lewismc commented 1 year ago

We also need to explore why data is being 'left behind' in proc_observations even after the TRIGGER has finished.

lewismc commented 1 year ago

@renato2099 suggested that we literally rearchitect the existing ingestion logic such that the sp_execute_data_migration stored procedure is simply implemented in-memory when we initially process the data. The ultimate goal is to only process each row of the eTUFF file once. I am going to rename this issue accordingly.

lewismc commented 1 year ago

Some considerations for the implementation include entirely removing the concept of the proc_observations table and then introducing a data_observations table which would be used for long-term persistence for all data which does not fit the following criteria

  1. data_time_series: (time); time here is usually at a high frequency, it can be in intervals of seconds or less
  2. data_position: (time, longitude, latitude)
  3. data_profile: (time, depth) where depth represents a standard depth in meters
  4. data_histogram_bin_data: (time, x, y) where dimension x comes from histogram_bin_info (i.e., how the histogram should be set up), and y comes from a groupby method on a measurement, e.g., count of occurrence or average of temperature. Naturally, the measurement specified by x and y should be meaningful, as there is no point to summarized maximum depth within the depth bin of 0-100 m.
  5. data_histogram_bin_info: containing definitions of binning schemes for summary tag data