spikelynch / CoataGlue

ANDS DC18 common codebase for data crosswalk and publication
Other
0 stars 0 forks source link

Data ingest testing and IDs #15

Open spikelynch opened 11 years ago

spikelynch commented 11 years ago

Testing is a pain because the relationship between data sources, dataset ids and redbox ingest files is not well-thought-out.

The coataglue script can be made to re-harvest datasets which it's already seen, but the redbox XML files which it produces have the existing IDs, which means that they won't be ingested - once redbox has seen a filename and harvested it, it will not revisit it.

What the system needs to do is to always increment the id of a history file even if it has seen it before - so every redbox output file will always have a unique filename even if the source dataset was the same.

spikelynch commented 10 years ago

Added a prefix setting so that I can ensure datasets are unique. This is a hack.