spacepy / dbprocessing

Automated processing controller for heliophysics data
5 stars 4 forks source link

New linkUningested script: find files not in database and symlink for import #54

Closed jtniehof closed 3 years ago

jtniehof commented 3 years ago

This PR adds a new script which searches for files that are in their final location and match the filename format of a valid product, but have no records in the database. It then makes a symbolic link from that file into the incoming directory.

It's worth explaining the dbp feature that's behind this. Normally files are put into the incoming directory and dbprocessing puts them in their final location. But it's handy to be able to synchronize an entire tree (e.g. via rsync) and set up dbprocessing so that the "final location" of a file is exactly the place where it gets synced--then dbp doesn't need to move it, but definitely needs to ingest it so it's aware of the file. The way to do this is make a symbolic link in incoming that links to the file--then dbprocessing does the ingest and removes the symlink, leaving the file in place. Right now the scripts that do this are mission-specific but I hope to have a generic one at some point.

If, however, one has messed up the normal sync and needs to figure out what's not been ingested, the script in this PR comes in to play.

PR Checklist

jtniehof commented 3 years ago

I just used this to dig out a bit of a mess on PSP, so it's been used in production now.

dnadeau-lanl commented 3 years ago

Normally I sync in the incoming dir and db moves it to the final location. Why do you sync in the final location?

balarsen commented 3 years ago

Normally I sync in the incoming dir and db moves it to the final location. Why do you sync in the final location?

I end up doing this for the interaction of multiple chains. Chain 1 creates a file that chain 2 needs.

jtniehof commented 3 years ago

Anything where you don't control the input, so in the particle processing world it's magnetic fields for pitch angles most commonly.

If I run rsync, my local directory has exactly the same files (and directory structure) as the remote. If I sync to incoming and there are 100 files, 100 files go into incoming. Then they get ingested and moved. Somebody on the other side makes 100 new files. Now I sync to incoming and there are 200 files, 100 of them duplicates (which I had to pull over the network even though I already had them in final location.) Bad stuff happens.

If I sync to final location and just symlink the new stuff, then in the case above I ingest 100 files the first time, and just the 100 new ones the second time.

jtniehof commented 3 years ago

This can also be useful even if you don't use the symlink-to-incoming as just a check that files haven't sprouted in your managed tree without being made by dbp (the symlink step is optional; can also just report out.)