Blocking process while one is running.

dnadeau-lanl commented 3 years ago

My suggested enhancement is ...

While processing raw data files, other files could be synced into the incoming directory. Ingesting and processing the new files could mess up things using concurrency.

This is related to #57 for ubet processing all files in a directory. We cannot launch a new ubet process until the first one is finished. The second ubet process will step on the first one creating useless and messed upHDF5 level_0 files.

Proposed enhancement

Block concurrency for some process.

Maybe having a flag in the database to block concurrency. LANL created a new table called processpidlink to check if a process is running before launching a new one.

One issue happening is that sometime the table is not reset properly and the flag remains set to true. If the machine is rebooted for example. (This makes things difficult to find why a new process is not starting when new files are synced.)

Alternatives

Right now I use "flock" to lock a process while one is running.

/usr/bin/flock -n $DIR/sage_senser_proc_ingest300.lck -c '. ~/.crontab.sh; python $SCRIPT/ProcessQueue_new.py -i -m sndd -t sage

This ensure that no ingestion can happen for instrument "sage".

Version of dbprocessing

LANL version

Closure condition

This issue should be closed when: A process cannot start until one is finished.

balarsen commented 3 years ago

@dnadeau-lanl, this is doable now with -P 1 in the ProcessQueue.py call. But I can certainly see broader utility. This is semi-related to the cpu and ram fields in the code table. Those were supposed to allow for dynamic decisions on how many processes to run, never implemented but it was for that.

jtniehof commented 3 years ago

On master, only one instance of ProcessQueue can run per database. There's a "logging" table that sets a currently_processing flag which is checked at DButils init. There's a clearProcessingFlag script that clears this in case ProcessQueue. I think this got axed on the sndd branch; not sure there's any locking of ProcessQueue there (maybe by instrument?)

@balarsen is referring to the -n switch, which restricts the total number of external processing processes that ProcessQueue (really runMe) will spawn at once. Normally each process counts as 1 (and the default is to have 2 running at once).

But the code table has RAM and CPU arguments to mark "heavy" processes. So if you specify -n 8 and have three codes of CPU 2 and two codes of CPU 1, the idea is to have that count as the 8 and ProcessQueue wouldn't spawn any more. The RAM column would operate similarly, so that the total of the CPU column of all codes running, and the total of the RAM column of all codes running, would each be less than the number specified to "-n". Again, not implemented right now, but wouldn't be too hard to do.

I think you're talking about a third restriction, which is to make sure any particular process is only being run once at any given time, because the code can only run once (conflicts with itself). This would ruin current ECT and PSP processing, which are often trying to chew through years' worth of data of the same type and need to run multiple instances of the same code to do so. The design was really based around multiple input files to a single output file. Marking processes as a singleton is probably a possibility but might not be necessary. I think this is an outgrowth of #57 and I think there may be other ways to address that use case, will continue there.

Incidentally I keep on referring to all these design decisions and assumptions and getting them documented rather than just in my head is definitely on the agenda...there's not been much place to go and find them.

balarsen commented 3 years ago

I do handle this "conflicts with itself" issue on Van Allen Probes. IDL codes are notorious for this issue, MagEIS reads inputs from a fixed filename to process. To make this all work I create the fixed filenames in temp directories and point the IDL to them and then it all works. In fact some of the codes also require the input data files in that directory, I copy them to the temp dir as well. All done in the code called from inside the chain.

jtniehof commented 3 years ago

Yes, the fact that the code is run in an isolated temp directory helps a lot with codes that might otherwise not play well.

spacepy / dbprocessing