uncharted-distil / distil

An analytic workbench for user-guided development of model pipelines
Apache License 2.0
13 stars 3 forks source link

Optimize ingest pipeline #1945

Open cdbethune opened 4 years ago

cdbethune commented 4 years ago

The ingest pipeline takes a long time to run, especially when dealing with larger datasets like H1B, and when dealing with remote sensing datasets. Now that we have a much better idea of how we will be ingesting data, we should do a complete review of its execution to see what can be removed or optimized.

A few observations:

  1. Instead of using a single ingest pipeline, we run a series of of very small pipelines, incurring the overhead of execution by the ta2 multiple times.
  2. Each stage makes a complete copy of the dataset, including any media, which is very costly when dealing with large image datasets.
cdbethune commented 4 years ago

I think I managed to cut down the run time of this pretty significantly with https://github.com/uncharted-distil/distil-compute/commit/dcf8a6bd94da2daa06b35d3dc71aa62580d0716d.

phorne-uncharted commented 3 years ago

Media data is no longer copied multiple times. Some steps that are no longer relevant have been removed.

For remote sensing datasets, the featurization step which outputs a prefeaturized version of the dataset is by far the longest step. Perhaps there is value in maybe doing that a background step that has to complete before the user can run models / clustering / etc. but does not hold up the ingest.