simpeg / aurora

software for processing natural source electromagnetic data
MIT License
13 stars 2 forks source link

Validation of Processing Summary Can Fail Due to Incorrect Uniqueness Assumption #260

Closed kkappler closed 1 year ago

kkappler commented 1 year ago

There is a new player on the stage inside of TFKernel called a "processing summary". This is a data frame that is derived from the RunSummary, i.e. it has one row per contiguous chunk of data to be processed, with the added twist that these rows are then replicated, once per decimation level. So if run summary has say 6 rows, and there are 4 decimation levels, then the processing_summary will have 24 rows.

The role of processing_summary is to provide the opportunity to evaluate each run-decimation_level pair before processing gets underway, and flag conditions that would make it invalid.

I had made the assumption that a tuple of (survey, station, run_id , decimation_level) would be unique, and so I had put a sanity check assert statement in to that effect. This is not a valid assumption. Consider the case where the primary station (P) for example has one run (a), for 2 weeks say, but the reference station (R) has two runs, each 1-week long (a, b), then the run summary will have run pairings ((P,a), R(a)), ((P, a),(R,b)), which is to say that (P,a) will occur more than one time.

This assumption needs to be removed, and moreover, a bookkeeping scheme to handle this so that valid/invalid is applied to each row of processing summary in these cases needs to be built.

kkappler commented 1 year ago

@laura-iris : I debugged the problem with the script you committed. It is a bug, and I will work on fixing it this month. In the meantime, working with another station-pair may not encounter this issue - it is a special case of overlapping runs.

kkappler commented 1 year ago

This problem is observed in process_mth5.process_mth5 during the STFT loop. A single row of the tfk.dataset_df is checked for validity, but when is_valid is run, it is checking more rows than it should, due to the assumption of unique (survey, station, run, decimation_level). In reality, run can be duplicated, so we need more specificity ... using the time interval should solve this.

N.B. update_dataset_df() also calls is_valid, and will need the same treatment.

kkappler commented 1 year ago

@laura-iris : I added another condition to the validation test that solves this problem, please update branch earthscope_tests and lmk if you still encounter.