terraref / computing-pipeline

Pipeline to Extract Plant Phenotypes from Reference Data
BSD 3-Clause "New" or "Revised" License
23 stars 13 forks source link

Update remaining extractors to use pyClowder 2 #213

Closed max-zilla closed 7 years ago

max-zilla commented 7 years ago

In order to perform bulk processing efficiently and safely, and to make extractors more robust and fail-resistant, we want to migrate them to use the new PyClowder2 python library instead of the original PyClowder, which has several 'hacks' to support our pipeline that 2 makes formal and complete.

This is not terribly difficult, but I am testing as I go along to make sure behavior does not change.

repos to update

testing with clowder-dev deploy updated repo code, install pyClowder 2 on extractor host (including enum and pyyaml if not pip installed yet)

Once all are migrated, our tutorial materials and templates will refer to PyClowder 2 and the corresponding improvements in code structure and ease of use.

max-zilla commented 7 years ago

@jterstriep I just created a pull request for my 3D scanner extractor repo. https://github.com/terraref/extractors-3dscanner/pull/1/files

I intended for this to be the best of both codebases here. Not going to merge quite yet.

jterstriep commented 7 years ago

Part of the configuration is designed around code sharing (eliminating duplication), ease of development, manual and automated testing, automatic builds, and continuous deployments. To meet the needs of all those systems, setup.py and requirements.txt definitely need to be at the root. Of course, this assumes the additional extractors are closely related. If not, then the extractors should be in their own repository.

The Dockerfile is another matter. It might be better to have different directories for each extractor with its own Dockerfile. We might even want a base Dockerfile that builds the common elements and then customize the extractor Dockerfile which might only consist of overriding the CMD line. Alternatively, we could have a general Dockerfile that builds a common image (i.e. a 3dscanner image) that allows any of the extractors to be run from the command line. Either way, I think it's better not to use a custom filename like Docker..

The problem with terra.ply2las.py is that when I try to import from it the extra dot causes confusion. My solution doesn't change the name for users and fixes the import problem. Is there any reason we can't change the filename?

Are we transitioning to pyClowder2 at this point?

max-zilla commented 7 years ago

@jterstriep understood, i'll move setup and requirements to the root. The repo extractors will theoretically all operate on the same type of sensor data so it makes sense to keep them together in that regard, even if their purposes may be different.

Will rename the file as well, just couldn't remember what the reason was.

We're transitioning to pyClowder 2, yeah. Primary reason is that @robkooper and I cleaned up a lot of the rough edges of pyC 1 that could cause issues in HPC environments. One big new feature is an error queue - if the extractor tries and fails to process large numbers of messages, we can route them into an error queue in RMQ for examination and resubmit them to processing queue later without losing any information. This was an important new addition before we crunched all these older datasets.

max-zilla commented 7 years ago

Finished initial updates of all extractors; now starting to test updates against clowder-dev.

max-zilla commented 7 years ago

Mostly finished testing - might need another tweak to hyperspectral extractor logic that moves files into same directory, but Jeff is refactoring some of that code right now.