Build lists of dataset IDs to be submitted for extraction

In order to complete the bulk extraction on older datasets, a list of dataset UUIDs that have not yet been processed needs to be created for each extractor. This is done directly from the Mongo database and written into a file.

example mongo query - demosaic extractor Find all datasets where we can't find any metadata created by the demosaic extractor.

db.datasets.find({"name": /stereoTop.*/}).forEach( function(doc) {
    var found = false;
    var extractorname = "terra.demosaic";
    db.metadata.find({"attachedTo._id": doc._id, "creator.name": extractorname}).forEach( function(subdoc) {
        found = true;
    });
    if (!found) {
        print(doc.id);
    }
});

extractors with lists to be generated

[x] demosaic
[x] flirIr
[x] plantCV
[x] hyperspectral
[x] ply2las
[x] netCDF
[x] canopy cover
[x] sensorposition (geospatial)
[x] environmentlogger

Once these lists are generated, we will iterate through each list and submit the dataset/file to Clowder for extraction with a call like:

POST http://terraref.ncsa.illinois.edu/clowder/api/datasets/<UUID>/extractions?key=<SECRET>
data='{"extractor":"terra.environmentlogger"}'

I'm generating these into /home/mburnet2/extractor_batch on the terra production VM and moving them locally for now.

terraref / computing-pipeline

Build lists of dataset IDs to be submitted for extraction #212