Open BrendenBarbour opened 4 years ago
(ingest_raw_data.py
) Write out a metadata file for each zip, only if everything finishes, inside processed folder (one meta for raw and one labels). Note on the workflow step in wiki this new output @CarolinaFurtado
RAK: "Currently we're only seeing if a corresponding directory exists in processed data.
one step beyond would be counting the files in raw (may need to unzip) and compare to number in processed.
2 steps beyond would be file name checks. then basically metadata output in processed the check (what was checked and results).
if this metadata exists in processed then skip the check.
basically, if have to unzip the raw each time to compare contents, then we def want to skip if it's been done before.
the metadata will have timestamp and we can always manually ensure that the processed folder hasnt been modified after the metadata check confirmation was created, if we're unsure whether somebody accidentally deleted something
the contents comparison is primarily to ensure nothing was half ingested."
Currently, no checks are in place to compare that the number of files in raw-data makes it into processed data. Would create a more robust system.