spacepy / dbprocessing

Automated processing controller for heliophysics data
5 stars 4 forks source link

Process by instrument #59

Open dnadeau-lanl opened 3 years ago

dnadeau-lanl commented 3 years ago

My suggested enhancement is ...

Relation to an issue

One mission/satellite can have many different instruments which need to be process separately.

Proposed enhancement

Create a instrument columns in the mission.

Call ProcessQueue.py with a new "instrument" flag -t.

./ProcessQueue.py -i -m sndd -t sage

This flag change all run/unit tests. Maybe we could have default instrument for compatibility.

Version of dbprocessing

LANL version

Closure condition

Allow different process scripts for different instruments in a mission.

jtniehof commented 3 years ago

Existing master supports instruments (the hierarchy is that every product has an instrument, every instrument has a satellite, every satellite has a mission.) There are a few places where lookup can be done by instrument; these can definitely be expanded.

What can't happen on master is running two simultaneous instances of ProcessQueue, each of which only works on a single instrument. This is because there is no guarantee or requirement that a given instrument's products depend only on that instrument. So say instrument A product 1 combines with instrument B product 1 to make product 2 (which would have to have an instrument associated with it, but which one doesn't matter much.) Then if there are two separate instances of ProcessQueue, there's a synchronization problem if A1 and B1 both show up and get processed and product 2 gets made twice.

There's also a single processqueue table that just has the file_id to process and any version increments requested. I think the sndd branch adds an instrument ID to essentially split the process queue by instrument.

In the past we've set up any completely noninteracting chains of dependencies simply as separate databases. I know there's some overhead to that once you're getting to setting up a Postgres database instead of just an sqlite, particularly if you don't have control of the database server. I think the better place to do separation within a single database might be at the mission level. That's going to take some work to really define those guarantees and restrictions (e.g. refusal to create a new productprocesslink that would cross missions), since much of the work to date has been one mission per db as well. I know what's going on the sndd branch is working for that use case but it's a fairly substantial change to the design philosophy so would break other use cases. We did design this with the idea that there's a single ProcessQueue at a time that has the whole state in its mind, and with access to all the data.

I can think of two reasons for supporting this sort of thing. One is just to speed up the ProcessQueue work itself by multiprocessing (as opposed to the time spend in processing the data.) I think there's a lot of scope to just plain make ProcessQueue's operations faster. There may also be good opportunities for parallelizing within a single ProcessQueue instantiation (e.g. by much more clever graph-based analysis of non-interacting decisions.)

The other is that you might have data that are only available on one machine and other data only available on a different machine, so a single ProcessQueue isn't even practicable. That does get back to maybe it makes more sense to have different databases, or at least different missions, if the data never interact. Again we were sort of thinking in terms of a network drive setup where the ProcessQueue has access to everything.

More implications of this in #58

dnadeau-lanl commented 3 years ago

Usually each satellite and instrument are independent. I don't see a case where there is a cross mission or cross instrument. It's really raw->level0->level1.

As well, I usually do the parallelism at the script levels. I don't rely on dbp for that. I rely on dbp to launch different processing for different instrument/mission. There might be some load balancing issue here when a machine is overwhelmed, but right now we do that manually with different instance of dbp on different machines, sharing the same database.

ProcessQueue should have access to everything through network drive etc.

balarsen commented 3 years ago

Usually each satellite and instrument are independent. I don't see a case where there is a cross mission or cross instrument. It's really raw->level0->level1.

Typically but not always. One SABRS example if we ever get that far.

This mixing done in ECT for magephem files.