Open cdbethune opened 4 years ago
I think I managed to cut down the run time of this pretty significantly with https://github.com/uncharted-distil/distil-compute/commit/dcf8a6bd94da2daa06b35d3dc71aa62580d0716d.
Media data is no longer copied multiple times. Some steps that are no longer relevant have been removed.
For remote sensing datasets, the featurization step which outputs a prefeaturized version of the dataset is by far the longest step. Perhaps there is value in maybe doing that a background step that has to complete before the user can run models / clustering / etc. but does not hold up the ingest.
The ingest pipeline takes a long time to run, especially when dealing with larger datasets like H1B, and when dealing with remote sensing datasets. Now that we have a much better idea of how we will be ingesting data, we should do a complete review of its execution to see what can be removed or optimized.
A few observations: