simpeg / aurora

software for processing natural source electromagnetic data
MIT License
13 stars 2 forks source link

FC management scheme for processing #319

Open kkappler opened 5 months ago

kkappler commented 5 months ago

FCs have been stored according to their run_ids in aurora. This can cause some unwanted behaviour:

Consider a case where we are processing a remote reference dataset, where station runs do not overlap cleanly. Below is a screengab from processing CAS04 with NVR08. The first table is a simple run summary of available data. The second table is the kernel dataset dataframe. Note that in the first table, run b is unique at station CAS04, and that run is approximately 10 days long. However, in the kernel dataset, there are two instances where run b from CAS04 is referenced ..

  1. Rows 0,1 pair run a from the reference station (NVR08) with run b from CAS04.
  2. Rows 2, 3, pair run b from the reference station (NVR08) with run b from CAS04.

So ... when these data are being processed according to the current flow, we encounter the following flow in the logic: Iteration begins over kernel_dataset.df and in the first row a 2860 second chunk of run b is extracted from the mth5, STFT-ed, and then it gets stored under: /Experiment/Surveys/CONUS_South/Stations/NVR08/Fourier_Coefficients/b/ but then, on row 2, a 769090 second chunk of run b is extracted from the mth5, STFT-ed, and then it gets stored under the same level, overwriting the previous data. This might process correctly the first time, but it will likely fail the second time.

The second time we process the file FCs will be detected and on rows 0,1 of the df, the STFT objects loaded will be 2860s from NVR08, 769090 from CAS04.

image

There are workarounds for this, but it is not clear what is the best.

  1. Remove the save_fc option that saves on the Fly. We would replace this with a separate (optional) step to "build_fcs". FCs would be build for complete runs, and then the spectrogram loader would use indexing on the stored FCs to load the appropriate sub-run.

Perhaps the cleanest way to do this would be to process separately each station using "single station" processing, with save_fc=True. Then all runs will get stored completely.

The only thing that needs to be checked/fixed for RR is then to use start and end times when loading FCs from MTH5.

  1. Each processing_run (row of kernel_dataset.df) could be given an id. The FCs could be saved under this "processing_run_id" rather than under the "acquisition_run_id"

  2. The Hackaround: Restrict the runs to only show up 1x in the kernel_dataset dataframe image