We need to drastically change the way how datasets are projected. The main issue is loading the entire dataframe. How do we get around that?
Enable chunked processing of columns to determine normalization functions (i.e. min/max, categories, ...). These are returned as closured variables in featurizers.
If we determine the dataset is too large, we precompute incremental PCA by using chunked parts of the data
If that already yields only 2 components, we return that as our coordinates
Then, we compute the actual projections on the PCA components/features
If gowers distance is used, we only support datasets with max 10k rows, as the distance matrix explodes quadratically.
If the dataset fits in memory, we use the actual projection
This basically reduces memory usage by order of magnitudes for large datasets (i.e. 10GB+ to 1-2GB) for 300k rows.
Screenshots
Result of UMAP of dataset with 315k rows on feature sulfonyl (157 features). Took around 15min.
featurizers
.Screenshots
Result of UMAP of dataset with 315k rows on feature sulfonyl (157 features). Took around 15min.