octopize / saiph

A projection package
https://saiph.readthedocs.io
Apache License 2.0
0 stars 0 forks source link

feat(saiph): implement a low-rank SVD with random methods for faster computations #91

Closed albanfelix closed 1 year ago

albanfelix commented 1 year ago

Link to the notion page for details: https://www.notion.so/octopize/Randomized-SVD-results-b30df20b0c7f41ebbcfa3b9325caf49f

Implementation in avatar/core: https://github.com/octopize/avatar/pull/777

The randomized is used only when we specify the number of retained dimension nf smaller than the dimension of the matrix to decompose.

Context

When we face high dimensional data, the computation of the SVD may be very expensive. Moreover, we often face cases where few singular values are much bigger than the others, meaning that few dimensions are sufficient to preserve the dataset variance.

The benchmarks used a fake dataset from create_csv: 1M lines, 10 categorical & 20 numerical features (120 dimensions after dummisation).

Full SVD

The full SVD is computed using the current Saiph implementation of the SVD (scipy).

Tracemalloc memory: current=8 605 724, peak=6 419 574 291

k=5

image

Randomized SVD

The fixed-rank SVD approximation computes a SVD with a smaller number of dimensions: l<min(m,n)

We used the algorithms described in http://arxiv.org/abs/0909.4061, page 27 and 29.

Tracemalloc memory: current=8 612 208, peak=4 096 625 713

k=5

Number of retained dimension (fixed-rank SVD approximation): nf=20 over 120 dimensions after dummisation

image

Results

Using a significantly lower number of dimensions (eg < 0.5 * the number of dimensions), we have better performance both in time and memory with the randomized SVD. However, to preserve high-quality avatar data, the data must be mostly explained by this few number of dimensions: it is not applicable to each use case.

jpetot commented 1 year ago

also, the CI is failing 😉