mgbckr / corals-lib-python

11 stars 2 forks source link

Final matrix unable to fit in memory - recommendations #1

Open Rridley7 opened 1 year ago

Rridley7 commented 1 year ago

Hello, thanks for the development of this tool! These speedups given look promising. I currently have a sparse dataset of ~1M features and ~300 samples which I was looking to calculate correlation for. In this case, the final matrix (1M x 1M) is obviously not able to fit into memory (numpy estimates several TB of ram necessary). I noted some of your recommendations in this case in your examples, however I was wondering if there were any current or future implementations of your tool which would allow for final matrices to be directly written to disk without the need to hold this on RAM?

mgbckr commented 1 year ago

Hi @Rridley7 , the basic functionality for this is implemented but still needs work and especially some cleanup. I hope to get this done at some point. In the meantime, you can use the top-k approach and reduce k with an appropriate approximation factor (see the paper).

from corals.correlation.topkdiff.base import cor_topkdiff
cor_topkdiff_result = cor_topkdiff(X1, X2, k=0.001, correlation_type="spearman", n_jobs=8)
SantosRAC commented 2 months ago

Hello @Rridley7 , were you able to run the code @mgbckr suggested you to run?

I am trying to run corals, but unfortunately I am getting an error. It looks like it is related with the matrix having NaN values, but it persists even if I use the pandas fillan() method

image

There is also a warning in the utils:

lib/python3.12/site-packages/corals/correlation/utils.py:8: RuntimeWarning: invalid value encountered in divide Xh /= np.std(Xh, axis=0) * np.sqrt(X.shape[0])

@mgbckr , do you know if the package is incompatible with Python 3.11 + and more recent libraries? (scikit-learn, numpy, pandas etc) ?

SantosRAC commented 2 months ago

FYI @mgbckr : I created a conda environment following the instructions on the README.md (now with Python 3.10) and it gives me the same error.

mgbckr commented 2 months ago

Hi @SantosRAC , thanks for trying this! Can you share some minimal (synthetic) data and code snippet so I can try it for myself?

SantosRAC commented 2 months ago

Hello @mgbckr !

Sure! I've attached a subsample of my matrix (10 genes vs 10 samples).

sample.tpm.txt

With the full matrix, I am importing, setting "Name" as the index column, trying to fill NAs with zeros and then trying to replace zeros with very small numbers to avoid problems with division by zero. Finally, I transpose it to make sure it is in the format corals is expecting.

df = pd.read_csv('sample.tpm.txt',
                         sep='\t')
df.set_index('Name', inplace=True)
df = df.fillna(0)
df = df.replace(0, 0.0001)
df_transposed = df.T
df_transposed.head(n=3)

To run corals:

from corals.threads import set_threads_for_external_libraries
set_threads_for_external_libraries(n_threads=1)
import numpy as np
from corals.correlation.topk.default import cor_topk
spearman_cor_topk_result = cor_topk(df_transposed, k=0.001, correlation_type="spearman", n_jobs=8)

Please, let me know if you need a larger subsample of my original matrix.

mgbckr commented 2 months ago

Thanks a lot! Can you try the following code, and report the result? For some reason, things are working fine for me and I am not getting the same exception. I tried it with Python 3.10 and 3.11.

data = """
Name    ERR4165185  ERR4165186  ERR4165187  ERR4165188  ERR4165189  ERR4165190  ERR4165191  ERR4165192  ERR4165193  ERR4165194
Scp1_US851008_k31_TRINITY_DN18756_c0_g1_i4  0   0   0   0   0   0   0   0   0.590832    0
Scp1_US851008_k25_TRINITY_DN10094_c0_g1_i2  3.92307 5.94091 5.58655 5.81978 5.39608 8.5646  6.63334 6.29947 4.76114 2.87519
Scp1_US851008_k25_TRINITY_DN7610_c1_g1_i24  0   0.584322    0.361902    0.372344    0.372583    0   0.16313 0   0.874301    0.567489
Scp1_US851008_k25_TRINITY_DN3138_c0_g1_i2   3.23304 3.66967 2.61626 4.58642 1.00118 7.74655 6.50818 3.50205 2.80706 3.18808
Scp1_US851008_k25_TRINITY_DN66949_c0_g1_i1  0.463693    0.55958 0.508452    0.188733    0   2.05569 0.622747    0.18403 1.22889 0.525269
Scp1_US851008_k25_TRINITY_DN42729_c0_g1_i3  NaN 0   0   0   0   0   0   0   4.98475 0
Scp1_US851008_k25_TRINITY_DN5537_c0_g1_i1   0   0   0   0   0.068946    0.997404    0.394103    0.13994 0.375641    3.50364
Scp1_US851008_k31_TRINITY_DN9195_c0_g2_i2   2.31904 1.38316 0.785248    2.0806  1.17822 0   0   0   1.1218  5.1715
Scp1_US851008_k31_TRINITY_DN9068_c0_g1_i22  0   0   0   0   0   0   0.164973    0.296276    0   2.88606
"""

from corals.threads import set_threads_for_external_libraries
set_threads_for_external_libraries(n_threads=1)

import pandas as pd
from io import StringIO
from corals.correlation.topk.default import cor_topk

df = pd.read_csv(StringIO(data), sep='\t')
df.set_index('Name', inplace=True)
df_transposed = df.T

cor_topk(df_transposed, k=0.001, correlation_type="spearman", n_jobs=4)
SantosRAC commented 1 month ago

Hello @mgbckr , how are you?

You are right. This sample is reall working, however I still cannot make it work with a larger matrix.

Is there any preferred way that I can share that full matrix with you?

mgbckr commented 1 month ago

Hi @SantosRAC

no, you can send it however you like :) Feel free to PM me.

mgbckr commented 1 month ago

Hi @SantosRAC

thanks for your e-mail and your example data! It turns out that the issue where zero variance features which collided with the preprocessing step. corals now raises a more descriptive error. See #5

I will keep this issue open as it originally refers to the issue of the result not fitting into memory. If additional issues arise in the NaN context, please open another issue.