mit-quest / necstlab-damage-segmentation

MIT License
5 stars 6 forks source link

Compare Raw Data Files to Processed Data Check via Metadata #68

Open BrendenBarbour opened 4 years ago

BrendenBarbour commented 4 years ago

Currently, no checks are in place to compare that the number of files in raw-data makes it into processed data. Would create a more robust system.

BrendenBarbour commented 4 years ago

Useful StackOverflow article. https://stackoverflow.com/questions/31124670/how-to-programmatically-count-the-number-of-files-in-an-archive-using-python

rak5216 commented 3 years ago

(ingest_raw_data.py) Write out a metadata file for each zip, only if everything finishes, inside processed folder (one meta for raw and one labels). Note on the workflow step in wiki this new output @CarolinaFurtado

CarolinaFurtado commented 3 years ago

RAK: "Currently we're only seeing if a corresponding directory exists in processed data.

one step beyond would be counting the files in raw (may need to unzip) and compare to number in processed.

2 steps beyond would be file name checks. then basically metadata output in processed the check (what was checked and results).

if this metadata exists in processed then skip the check.

basically, if have to unzip the raw each time to compare contents, then we def want to skip if it's been done before.

the metadata will have timestamp and we can always manually ensure that the processed folder hasnt been modified after the metadata check confirmation was created, if we're unsure whether somebody accidentally deleted something

the contents comparison is primarily to ensure nothing was half ingested."