Processing all files in an input directory

dnadeau-lanl commented 3 years ago

My suggested enhancement is ...

Some scripts perform better when processing all files found in an input directory. For example to process raw data to level_0, I usually specify --input-dir --output-dir. The script that is being called will process all input files found in --input-dir and create daily files in the --output-dir. (Often the output directory is the mission incoming directory to process level_1).

Proposed enhancement

ubet_wrapper_sage.py  ....  -o ./level_1/  -i ./dbincoming/sabrs/ingested -r 1 in.h5 out.h5

Right now I just discard the dbprocessing in.h5 and out.h5 files. Since this script will process all files in the directory, the table ProcessQueue is clean up for that product/process after success.

Alternatives

ubet can process a file at a time and aggregate to a specific day. We need to make sure that files are sorted in time of arrival in the HDF5. (so, time variable need to be sorted properly)

Version of dbprocessing

LANL version

Discussion

Maybe we can add a flag for directory processing in the database.

jtniehof commented 3 years ago

Usually I treat anything that's not about single files as something that happens before the dbp ingest. We had that on both ECT and PSP, where there's some messiness in the outside world that gets handled by special mission-specific scripts to get things into a nice dbp-friendly organization. In particular if dbp is throwing a whole directory at something without knowing the files in that directory, there are implications for the provenance--there isn't any knowledge of all the input files.

I can walk you through how those scripts work and maybe it will suggest some ideas. I definitely want to genericize those and pull them into the dbprocessing project, even if they don't run as part of ProcessQueue.

Your last note:

We need to make sure that files are sorted in time of arrival in the HDF5

does suggest a fairly straightforward enhancement. We've so far required that codes accept the input arguments in arbitrary order. We could work on defining an order and guaranteeing that (e.g. by utc_start_time?)

jtniehof commented 3 years ago

Now that I'm thinking about it (and we're discussing), I think the input might work pretty well in dbp's current file-based approach. A process could be defined as taking all of the available input files of a particular product (although any limitation on how to do this time-wise would be interesting, maybe define as "all for a month" or something...) and then dbp could make a temporary directory and either copy or symlink the files into that directory, then pass the directory to the code. For products where it takes multiple files to span a day, this still would run into issues where dbp assumes that the tuple of (utc_file_date, product, version) is unique.

Something similar is probably possible on the output: give the code an output directory and make it more contingent on the inspector to figure out all the details of the files that were made. Would be tricky to make sure output files are distinguished from temporary files, and also to get the versioning passed to the code (since dbp determines the version and that goes in with the filename information.)

I think the directory-based input and directory-based output should be separate options.

jtniehof commented 3 years ago

One other option in the short term: if the input directory files are all at the "beginning" of the chain (i.e. not created by dbp), there is support now for processed with no inputs. Then dbp could just not know about the input files, the code is run (by DBRunner), and then the output files appear and get ingested into dbp.

spacepy / dbprocessing