quince-science / QuinCe

QuinCe is an online tool for processing and quality control of data from scientific instruments, with a primary focus on oceanic data.
https://quince.science
GNU General Public License v3.0
7 stars 8 forks source link

Pass data between jobs #2675

Open squaregoldfish opened 1 year ago

squaregoldfish commented 1 year ago

Since the jobs are typically I/O bound at this point, can we pass data from one job to the other in the chain? e.g. AutoQCJob and DataReductionJob both need SensorValues (It's 2:30am and I can't remember the exact requirements, but you get the idea.)

This might mean that data set processing needs to be finished in one go instead of jumping between them, to prevent a ton of RAM being allocated while it's switching between datasets if there's loads queued at once.

squaregoldfish commented 1 year ago

Possibly implement some concept of a "Job Sequence" for data processing jobs to run in one go.

The main sequence will start at Sensor QC and run through to Data Reduction QC. We pass a map of Objects between the jobs, and the following job knows what needs to be in the map for it to do its thing. If anything isn't there it will fail.

Note that jobs that are not the first entry in a sequence cannot be requeued.

squaregoldfish commented 1 year ago

Note that the sensor offsets restarts the calculations at Data Reduction, which will be in the middle of the standard job flow.

squaregoldfish commented 1 year ago

Be careful with SearchableSensorValuesList- if it's made groups for grouped measurements, they will be come invalid if the QC changes.

squaregoldfish commented 3 weeks ago

Testing stats:

Original: Recalculate dataset: 2m 50s, 10.8Gb

squaregoldfish commented 3 weeks ago

DataSetSensorValues operates in different modes (All, Ignore Flushing, Ignore Internal Calibrations). This will need to be build into the class because the different jobs need different modes.

1am thought: Maybe we just pass a list of all SensorValues between jobs, and build a new DatasetSensorValues object from it at the start of a new job which filters them? Not sure how well that will work with memory usage for all the data structures in a DataSetSensorValues object. But it might not be too bad, since we won't be duplicating SensorValue objects. Try it and see.

At the top of each job, see if the TreeSet<SensorValue> exists in the transfer data, and if not load it. We may need a new method in DataSetDataDB for this.