paulrolland1307 / SCORE

Code used in the paper "Score matching enables causal discovery of nonlinear additive noise models", Rolland et al., ICML 2022
GNU Affero General Public License v3.0
14 stars 4 forks source link

Applying SCORE to large datasets #2

Open fred887 opened 6 months ago

fred887 commented 6 months ago

Hello,

I would like to apply SCORE to some very large datasets (e.g. DAGs with d=approx 200 nodes and n=approx 20000 samples).\ I am currently blocked by a lack of memory error inside the procedure Stein_hess() due to the X_diff tensor whose memory becomes huge:

memory_X_diff = n * n * d  * sizeof(float)

For a tensor of float64 (the default setting), the memory needed by X_diff is ~ 800 GB; ~400GB for float32 or ~200GB for float16: this is just too big to fit in memory.

I know the standard approach would be to subsamples my dataset until everything fits inside memory. But I need to keep all samples for my experiments.

I have seen in your article that it is possible to use kernel approximation methods (such as MEKA) to reduce the computation load.\ Indeed, with MEKA we can compute the kernel matrix K without needing to compute the overly large X_diff tensor.\ However, the X_diff tensor is still necessary to compute the nablaK and nabla2K variables.

Would you know a way to completely avoid the computation of the very large X_diff tensor?

Or any other way to compute the diagonal elements of the Hessian matrix with large datasets?

Thank you very much for your help,

Best,

Frederic

paulrolland1307 commented 6 months ago

Hello Frederic,

What you actually need to compute is the matrix K, which is nxn and that you should be able to store. Computing X_diff is a handy way to compute K, but indeed is nxnxd. What you can do is to split the computation of K in different parts. Suppose you decompose your dataset in 10 different parts. Then, you need to compute X_diff for each subpart, which will be of size n/10xn/10xd and should be tractable. You then need to do it for the 100 pairs of subparts, but it shouldn’t take too long. And the same can be applied to compute nablaK, nabla2K.

The next challenge you will face is to invert the large dense matrix K. But you don’t necessarily need to invert it, and rather solve the linear system (K+eta I)X = nabla^2 K for X. This can be done using classical techniques, e.g., conjugate gradient. Also note that this is a matrix linear system where X is of size nxd, but it can be separated per column to solve d vector linear systems.

Does this make sense ?

Best,

Paul

On 26 Jan 2024, at 17:37, fred887 @.***> wrote:

Hello,

I would like to apply SCORE to some very large datasets (e.g. DAGs with d=approx 200 nodes and n=approx 20000 samples). I am currently blocked by a lack of memory error inside the procedure Stein_hess() due to the X_diff tensor whose memory becomes huge:

memory_X_diff = n n d * sizeof(float) For a tensor of float64 (the default setting), the memory needed by X_diff is ~ 800 GB; ~400GB for float32 or ~200GB for float16: this is just too big to fit in memory.

I know the standard approach would be to subsamples my dataset until everything fits inside memory. But I need to keep all samples for my experiments.

I have seen in your article that it is possible to use kernel approximation methods (such as MEKA) to reduce the computation load. Indeed, with MEKA we can compute the kernel matrix K without needing to compute the overly large X_diff tensor. However, the X_diff tensor is still necessary to compute the nablaK and nabla2K variables.

Would you know a way to completely avoid the computation of the very large X_diff tensor?

Or any other way to compute the diagonal elements of the Hessian matrix with large datasets?

Thank you very much for your help,

Best,

Frederic

— Reply to this email directly, view it on GitHub https://github.com/paulrolland1307/SCORE/issues/2, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADNORFDSZRPDJIH6N7Q6F3TYQPLUZAVCNFSM6AAAAABCMNE45SVHI2DSMVQWIX3LMV43ASLTON2WKOZSGEYDENJRGQ2DENA. You are receiving this because you are subscribed to this thread.

fred887 commented 6 months ago

Hello Paul,

Thank you very much for your suggestions!

Indeed, the computation of the matrix K using a block decomposition of the matrix X_diff will allow to compute the full K matrix block by block, replacing the computation of the full X_diff matrix by the computation of a series of smaller sub-blocks of X_diff.

For the K matrix, we can indeed use square blocks with shape such as n/10 x n/10 x d.

However, for the nablaK and nabla2K matrices, I think we will rather need to use a series of full slices of X_diff (such as n/100 x n x d) as we need to sum the components of the X_diff matrix over the full dataset of size n, as expressed in the Einstein summation: for nablaK, for instance we have: image

I agree with you also regarding solving the linear system (K + eta I)X=nabla2K using the conjugate gradient as a way to avoid inverting the large K matrix: thank you very much for the suggestion!

Best,

Frederic