sdv-dev / RDT

A library of Reversible Data Transforms
Other
121 stars 26 forks source link

Improve `ClusterBasedNormalizer` performance #336

Open fealho opened 2 years ago

fealho commented 2 years ago

The BayesGMMTransformer should be experimented with to improve performace. The current parameters (the weight_threshold and the default values passed to the BayesianGaussianMixture) should be experimented with and new default values should be chosen.

The code can also be sped up. The reverse_transform is already much quicker than the other two methods, and fit takes almost all of its time fitting the BayesianGaussianMixture, which is unavoidable. Instead, the biggest gains can be achieved by improving the transform method, specifically the following lines:

https://github.com/sdv-dev/RDT/blob/6b07fee5f88d278c667ba0fef2d3729bb2d4195c/rdt/transformers/numerical.py#L625-L632

These lines take the majority of the transformation runtime, so any improvement would significantly speedup the whole process.

npatki commented 2 years ago

The old BayesGMMTransformer has now been renamed to ClusterBasedNormalizer in RDT 1.0. Changing the title to reflect this.