terraref / computing-pipeline

Pipeline to Extract Plant Phenotypes from Reference Data
BSD 3-Clause "New" or "Revised" License
23 stars 13 forks source link

Build lists of dataset IDs to be submitted for extraction #212

Closed max-zilla closed 7 years ago

max-zilla commented 7 years ago

In order to complete the bulk extraction on older datasets, a list of dataset UUIDs that have not yet been processed needs to be created for each extractor. This is done directly from the Mongo database and written into a file.

example mongo query - demosaic extractor Find all datasets where we can't find any metadata created by the demosaic extractor.

db.datasets.find({"name": /stereoTop.*/}).forEach( function(doc) {
    var found = false;
    var extractorname = "terra.demosaic";
    db.metadata.find({"attachedTo._id": doc._id, "creator.name": extractorname}).forEach( function(subdoc) {
        found = true;
    });
    if (!found) {
        print(doc.id);
    }
});

extractors with lists to be generated

Once these lists are generated, we will iterate through each list and submit the dataset/file to Clowder for extraction with a call like:

POST http://terraref.ncsa.illinois.edu/clowder/api/datasets/<UUID>/extractions?key=<SECRET>
data='{"extractor":"terra.environmentlogger"}'

I'm generating these into /home/mburnet2/extractor_batch on the terra production VM and moving them locally for now.

max-zilla commented 7 years ago

Generated these into files.