tagbase / tagbase-server

tagbase-server is a data management web service for working with eTUFF and nc-eTAG files.
https://oiip.jpl.nasa.gov/doc/OIIP_Deliverable7.4_TagbasePostgreSQLeTUFF_UserGuide.pdf
Apache License 2.0
7 stars 2 forks source link

Move migration logic from SQL stored procedure to python #313

Open renato2099 opened 9 months ago

renato2099 commented 9 months ago

Right now, after we ingest the e-tuff file we write it down to disk and then let postgres ingest and process it. Since we have all of the data already in memory as a dataframe, we should process it there and only write the results to postgres.

renato2099 commented 9 months ago

hi @lewismc , these are the high-level steps we currently do in the SQL stored procedure

1. populate `data_time_series` using data from `proc_observations` where observation_types.variable_name IN ('datetime', 'depth', 'temperature', 'light', 'internal temperature')

2. populate `data_position`  using data from `proc_observations` where observation_types.variable_name IN ('longitude')
3. update `data_position` with latitudes
4. update `data_position` with lon_error and lat_errors

5. populate `data_histogram_bin_info` where b.variable_name LIKE 'HistDepthBinMax%'
6. populate `data_histogram_bin_data` where b.variable_name LIKE 'TimeAt%'

7. populate `data_profile` where b.variable_name LIKE 'PdtDepth%'
8. update `data_profile` where b.variable_name LIKE 'PdtTempMin%'
9. update `data_profile` where b.variable_name LIKE 'PdtTempMax%'

 -- SQL update statements to link measurement date time with position date time
10. update `data_time_series`.`position_date_time` using data from `date_time`
11. update `data_histogram_bin_data`.`position_date_time` using data from `date_time`
12. update `data_histogram_bin_data`.`data_profile` using data from `date_time`

another advantage of moving this to python is that we could more easily parallelize processing of the file while stop moving data across different tables.