Open Rridley7 opened 1 year ago
Hi @Rridley7 , the basic functionality for this is implemented but still needs work and especially some cleanup. I hope to get this done at some point. In the meantime, you can use the top-k approach and reduce k
with an appropriate approximation factor (see the paper).
from corals.correlation.topkdiff.base import cor_topkdiff
cor_topkdiff_result = cor_topkdiff(X1, X2, k=0.001, correlation_type="spearman", n_jobs=8)
Hello @Rridley7 , were you able to run the code @mgbckr suggested you to run?
I am trying to run corals, but unfortunately I am getting an error. It looks like it is related with the matrix having NaN values, but it persists even if I use the pandas fillan() method
There is also a warning in the utils:
lib/python3.12/site-packages/corals/correlation/utils.py:8: RuntimeWarning: invalid value encountered in divide Xh /= np.std(Xh, axis=0) * np.sqrt(X.shape[0])
@mgbckr , do you know if the package is incompatible with Python 3.11 + and more recent libraries? (scikit-learn, numpy, pandas etc) ?
FYI @mgbckr : I created a conda environment following the instructions on the README.md (now with Python 3.10) and it gives me the same error.
Hi @SantosRAC , thanks for trying this! Can you share some minimal (synthetic) data and code snippet so I can try it for myself?
Hello @mgbckr !
Sure! I've attached a subsample of my matrix (10 genes vs 10 samples).
With the full matrix, I am importing, setting "Name" as the index column, trying to fill NAs with zeros and then trying to replace zeros with very small numbers to avoid problems with division by zero. Finally, I transpose it to make sure it is in the format corals is expecting.
df = pd.read_csv('sample.tpm.txt',
sep='\t')
df.set_index('Name', inplace=True)
df = df.fillna(0)
df = df.replace(0, 0.0001)
df_transposed = df.T
df_transposed.head(n=3)
To run corals:
from corals.threads import set_threads_for_external_libraries
set_threads_for_external_libraries(n_threads=1)
import numpy as np
from corals.correlation.topk.default import cor_topk
spearman_cor_topk_result = cor_topk(df_transposed, k=0.001, correlation_type="spearman", n_jobs=8)
Please, let me know if you need a larger subsample of my original matrix.
Thanks a lot! Can you try the following code, and report the result? For some reason, things are working fine for me and I am not getting the same exception. I tried it with Python 3.10 and 3.11.
data = """
Name ERR4165185 ERR4165186 ERR4165187 ERR4165188 ERR4165189 ERR4165190 ERR4165191 ERR4165192 ERR4165193 ERR4165194
Scp1_US851008_k31_TRINITY_DN18756_c0_g1_i4 0 0 0 0 0 0 0 0 0.590832 0
Scp1_US851008_k25_TRINITY_DN10094_c0_g1_i2 3.92307 5.94091 5.58655 5.81978 5.39608 8.5646 6.63334 6.29947 4.76114 2.87519
Scp1_US851008_k25_TRINITY_DN7610_c1_g1_i24 0 0.584322 0.361902 0.372344 0.372583 0 0.16313 0 0.874301 0.567489
Scp1_US851008_k25_TRINITY_DN3138_c0_g1_i2 3.23304 3.66967 2.61626 4.58642 1.00118 7.74655 6.50818 3.50205 2.80706 3.18808
Scp1_US851008_k25_TRINITY_DN66949_c0_g1_i1 0.463693 0.55958 0.508452 0.188733 0 2.05569 0.622747 0.18403 1.22889 0.525269
Scp1_US851008_k25_TRINITY_DN42729_c0_g1_i3 NaN 0 0 0 0 0 0 0 4.98475 0
Scp1_US851008_k25_TRINITY_DN5537_c0_g1_i1 0 0 0 0 0.068946 0.997404 0.394103 0.13994 0.375641 3.50364
Scp1_US851008_k31_TRINITY_DN9195_c0_g2_i2 2.31904 1.38316 0.785248 2.0806 1.17822 0 0 0 1.1218 5.1715
Scp1_US851008_k31_TRINITY_DN9068_c0_g1_i22 0 0 0 0 0 0 0.164973 0.296276 0 2.88606
"""
from corals.threads import set_threads_for_external_libraries
set_threads_for_external_libraries(n_threads=1)
import pandas as pd
from io import StringIO
from corals.correlation.topk.default import cor_topk
df = pd.read_csv(StringIO(data), sep='\t')
df.set_index('Name', inplace=True)
df_transposed = df.T
cor_topk(df_transposed, k=0.001, correlation_type="spearman", n_jobs=4)
Hello @mgbckr , how are you?
You are right. This sample is reall working, however I still cannot make it work with a larger matrix.
Is there any preferred way that I can share that full matrix with you?
Hi @SantosRAC
no, you can send it however you like :) Feel free to PM me.
Hi @SantosRAC
thanks for your e-mail and your example data! It turns out that the issue where zero variance features which collided with the preprocessing step. corals
now raises a more descriptive error. See #5
I will keep this issue open as it originally refers to the issue of the result not fitting into memory. If additional issues arise in the NaN context, please open another issue.
Hello, thanks for the development of this tool! These speedups given look promising. I currently have a sparse dataset of ~1M features and ~300 samples which I was looking to calculate correlation for. In this case, the final matrix (1M x 1M) is obviously not able to fit into memory (numpy estimates several TB of ram necessary). I noted some of your recommendations in this case in your examples, however I was wondering if there were any current or future implementations of your tool which would allow for final matrices to be directly written to disk without the need to hold this on RAM?