Optimize ingest pipeline

cdbethune commented 4 years ago

The ingest pipeline takes a long time to run, especially when dealing with larger datasets like H1B, and when dealing with remote sensing datasets. Now that we have a much better idea of how we will be ingesting data, we should do a complete review of its execution to see what can be removed or optimized.

A few observations:

Instead of using a single ingest pipeline, we run a series of of very small pipelines, incurring the overhead of execution by the ta2 multiple times.
Each stage makes a complete copy of the dataset, including any media, which is very costly when dealing with large image datasets.

cdbethune commented 4 years ago

I think I managed to cut down the run time of this pretty significantly with https://github.com/uncharted-distil/distil-compute/commit/dcf8a6bd94da2daa06b35d3dc71aa62580d0716d.

phorne-uncharted commented 3 years ago

Media data is no longer copied multiple times. Some steps that are no longer relevant have been removed.

For remote sensing datasets, the featurization step which outputs a prefeaturized version of the dataset is by far the longest step. Perhaps there is value in maybe doing that a background step that has to complete before the user can run models / clustering / etc. but does not hold up the ingest.

uncharted-distil / distil

Optimize ingest pipeline #1945