Closed lazappi closed 4 years ago
So either this is a malformed matrix line 83 2661 9.753568422904793e-309
error in the multiprocessing part that is done in C, and then a particular file can't be found by the downstream analysis in python.... or you just don't have any files in /tmp/lisi_tmp1601377709
. Could you check if there are any files in there?
Otherwise, @mbuttner... do you have any ideas?
The temp directory has an input.mtx
file but none of the _indices*
or _distances*
files that my other LISI temp directories have.
It seems like the knn-graph has not been computed by the cpp
function. That's why the different 'chunk files' are not present.
where could the line malformed matrix line 83 2661 9.753568422904793e-309
be coming from? Not from the cpp
function? And if one of the cpp
function workers fails, would the others still produce output files?
How does the input.mtx
file look like? In the LISI-code, I used
mmwrite(mtx_file_path,
adata.uns['neighbors']['connectivities'],
symmetry='general')
To create the input.mtx
file. If it is empty (or nonsense), that might explain why the cpp
function did not produce any output.
where could the line
malformed matrix line 83 2661 9.753568422904793e-309
be coming from? Not from thecpp
function? And if one of thecpp
function workers fails, would the others still produce output files?
The cpp
function runs only once to compute the knn-graph on the full data matrix and splits the output (an index list) into chunks. So that has nothing to do with the different workers.
The malformed matrix line
error (I haven't seen it yet), might be caused by the 'file not found error'. That's why I suspect the error to be caused by an erroneous input to the cpp
function.
What is your scanpy version? Is the nearest neighbor graph still located at adata.uns['neigbors']['connectivities']
or do you have an .obsp
object instead?
I would presume that the cpp
run failed due to the malformed matrix line
error and therefore produced no output (if it's just a single job where 1 error means no files are generated), which meant the error arose that no input file was available to read to further the calculations. But of course... I have no idea here 😄
How does the
input.mtx
file look like?
It looks ok as far as I can tell
>>> mat = mmread('input.mtx')
>>> mat
<4314x4314 sparse matrix of type '<class 'numpy.float64'>'
with 86450 stored elements in COOrdinate format>
>>> mat_dense = mat.todense()
>>> mat_dense
matrix([[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
...,
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]])
>>> scipy.stats.describe(mat_dense)
DescribeResult(nobs=4314, minmax=(array([0., 0., 0., ..., 0., 0., 0.]), array([1., 1., 1., ..., 1., 1., 1.])), mean=array([0.00149025, 0.00100387, 0.00126313, ..., 0.00111166, 0.00090563,
0.00126314]), variance=array([0.00086184, 0.0005056 , 0.00051443, ..., 0.00070479, 0.00040159,
0.00078288]), skewness=array([28.81485849, 30.74991031, 27.84944371, ..., 31.60309792,
35.09651265, 28.52167416]), kurtosis=array([ 938.79817136, 1120.46076981, 1009.07881687, ..., 1081.65372246,
1546.08669569, 900.12986143]))
What is your scanpy version? Is the nearest neighbor graph still located at
adata.uns['neigbors']['connectivities']
or do you have an.obsp
object instead?
I have scanpy v1.4.6 and anndata v0.7.1. It's only a problem for scanorama on a couple of scenarios so I think it's something to do with that integration output, not something general.
OK, but it puzzles me that you don't have the _indices*
and _distances*
files in your tmp
directory. Have they been removed automatically or haven't they been created at all in the first place?
Coming back to the malformed matrix error
: I also assume that a distance of 9e-309
is too small to be represented as a double in C++
.
The error is hardcoded from the C++ part:
Any idea what the error refers to?
Okay, so weight
is a C++ 64 bitdouble
, which has limits of: -1.7E+308 to +1.7E+308 and about 16 decimal digits. The e-309 unfortunately doesn't fit into that. It seems that scanorama uses a more precise method internally for their calculations. I would probably just look for the value and set it to 0 and rerun.
The question is... should we catch this in the code or not? If we look for this error, in the LISI C++ code, we will slow down the metric further, which doesn't seem like a good idea. We could look for this in the python code though. The real question is: how does python store this value? Could you look for the offending value, @lazappi, and check how python stores it internally? At least we know it should be the value in adata.X[adata.obs_names[83],adata.var_names[2661]]
.
I can't get the indexing you suggested to work, not sure what is going on here:
>>> adata.X[adata.obs_names[83],adata.var_names[2661]]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/luke.zappia/miniconda3/envs/scIB-python/lib/python3.7/site-packages/scipy/sparse/_index.py", line 33, in __getitem__
row, col = self._validate_indices(key)
File "/Users/luke.zappia/miniconda3/envs/scIB-python/lib/python3.7/site-packages/scipy/sparse/_index.py", line 137, in _validate_indices
row = self._asindices(row, M)
File "/Users/luke.zappia/miniconda3/envs/scIB-python/lib/python3.7/site-packages/scipy/sparse/_index.py", line 161, in _asindices
raise IndexError('Index dimension must be <= 2')
IndexError: Index dimension must be <= 2
>>> adata.obs_names[83]
'human3_lib3.final_cell_0803-0'
>>> adata.var_names[2661]
'GAS8'
>>> adata.X['human3_lib3.final_cell_0803-0', 'GAS8']
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/luke.zappia/miniconda3/envs/scIB-python/lib/python3.7/site-packages/scipy/sparse/_index.py", line 33, in __getitem__
row, col = self._validate_indices(key)
File "/Users/luke.zappia/miniconda3/envs/scIB-python/lib/python3.7/site-packages/scipy/sparse/_index.py", line 137, in _validate_indices
row = self._asindices(row, M)
File "/Users/luke.zappia/miniconda3/envs/scIB-python/lib/python3.7/site-packages/scipy/sparse/_index.py", line 161, in _asindices
raise IndexError('Index dimension must be <= 2')
IndexError: Index dimension must be <= 2
Using numbers for indexing doesn't show anything interesting:
>>> print(adata.X[80:85, 2658:2663])
(0, 0) -0.002799201
(0, 1) -0.007948541
(0, 2) 0.001190949
(0, 3) 0.0025276951
(0, 4) -0.00058060087
(1, 0) -0.0064782263
(1, 1) -0.00036682197
(1, 2) -0.004223289
(1, 3) 0.014151698
(1, 4) -0.0005493724
(2, 0) -0.0060044294
(2, 1) -0.0008786302
(2, 2) -0.0067326566
(2, 3) 0.024539318
(2, 4) -0.0005234006
(3, 0) -0.0020147967
(3, 1) 0.02790912
(3, 2) 0.0010080308
(3, 3) 0.0024890346
(3, 4) -0.00047112707
(4, 0) 0.014065156
(4, 1) 0.011179721
(4, 2) 0.0041127596
(4, 3) 0.0038861749
(4, 4) -0.000660761
Looking for the smallest absolute value in X:
>>> x = adata.X.toarray()
>>> x_abs = np.absolute(x)
>>> x_abs.argmin()
22979577
>>> x.flatten()[22979577]
-4.0687386e-13
I tried to run lisi_graph()
manually to verify the error but I couldn't quite get it working.
The indexing error is just numpy being numpy. I forgot you'd have to index separately and before the .X
by: adata[adata.obs_names[83],:][:,adata.var_names[2661]].X
. Otherwise, using the numbers would have been better. So adata.X[83,2661]` should work i guess.
Can you also check for the smallest value in adata.uns['neighbors']['connectivities']
?
It should then be: (3, 3) 0.0024890346
in your list... Ah... just saw @mbuttner's comment... these are connectivity values, and not in adata.X
The neighbours aren't stored in the object so I had to compute them. I tried to do that the same way as the script but might be a bit different.
>>> scIB.preprocessing.reduce_data(adata_int, n_top_genes=None, neighbors=True, use_rep='X_pca', pca=True, umap=False)
Finding the smallest non-zero connectivity values:
>>> x = adata_int.uns['neighbors']['connectivities'].toarray()
>>> i, j = np.where(x==np.min(x[np.nonzero(x)]))
>>> x[i, j]
array([2.65916e-318, 2.65916e-318])
There are definitely some tiny values there. For some reason the indices are different though.
>>> i
array([2851, 3974])
>>> j
array([3974, 2851])
>>> x[i, j]
As far as I can tell these are stored as Python floats (but I'm not really sure how to check this).
>>> x_list = x[i, j].tolist()
>>> x_list
[2.65916e-318, 2.65916e-318]
>>> type(x_list[0])
<class 'float'>
Float info for my system (from here https://stackoverflow.com/questions/1835787/what-is-the-range-of-values-a-float-can-have-in-python):
>>> sys.float_info
sys.float_info(max=1.7976931348623157e+308, max_exp=1024, max_10_exp=308, min=2.2250738585072014e-308, min_exp=-1021, min_10_exp=-307, dig=15, mant_dig=53, epsilon=2.220446049250313e-16, radix=2, rounds=1)
# Normalised min (with loss of precision)
>>> sys.float_info.min * sys.float_info.epsilon
5e-324
Hmm. Interesting. When you use scanorama, you obtain 2 different outputs (an embedding, stored in X_emb
and the corrected data matrix. I assume that when you compute the neighbors based on X_pca
, the distances are different from the ones when you use X_emb
.
Otherwise, the distances are pretty low, so we could write a function that shifts all values that are below 1e-308
to that minimum/maximum in the python script before we run the C++
code. I think that this shouldn't take much time/memory, but might distort the neighborhood graph (esp. if there are a couple of these small values). Another option would be to multiply all values with a sufficiently large constant such that the we keep the overall relation of the neighbors.
There are definitely some tiny values there. For some reason the indices are different though.
that's because these are even smaller values than the error value you had before... that was probably just the first value that was smaller than 1e-308. If you check the 83,2661 value in the connectivity matrix, you'll probably find it's the value depicted above. Very curious that python can store floats smaller than your system claims possible.
I think I would go with Maren's "set to zero" suggestion. I wonder if this error might not occur on a different system... but then this is already 64-bit floats... so no idea.
That seems like a reasonable solution to me. I assume that connectivities that small won't make any meaningful difference to the scores. Maybe print a message though so you can tell it's happening.
We would then just need to change the error message in the knn_graph.cpp
file I guess. However the error catching there is pretty generic, so hard to give a specific error. Maybe we just add something like "Warning: this means that no file is generated in knn_graph.o
and will thus lead to metric errors in the LISI metric". And maybe add something about this might have to do with exceedingly small or large connectivites.
I was thinking you would do the check in Python and if any values are too small set them to zero and print something like "Warning: some connectivities are too small and have been set to zero, this may effect the values for LISI scores". That way you should still get a score rather than an error. Or is that what you are suggesting?
As an aside it's a real pain as a user if one of the metrics fails. You can't turn off an individual metric and you can't really skip the specific scenario that has the error. It would almost be better to return an NA for any metric errors rather than failing.
That is true.. we could check in python beforehand. I was thinking just improve the error message where it's thrown (in the C++ code).
We would just have to generally catch errors in the metrics()
function. However, in that case you would not know what the error was... or we print the error but assign NA instead of raising it. You can turn off a metric, but only by going into the code...
Is there a solution for this already? I am currently getting the same error on almost all of my scArches/scVI runs.
ls /tmp/lisi_tmp1603868666/
input.mtx
Traceback (most recent call last):
File "/storage/groups/ml01/workspace/karin.hrovatin//code/diabetes_analysis/integration/2_evaluation_script.py", line 259, in <module>
compute_metrics(adata_full=adata_full,latent_adata=latent_adata,metrics=metrics,cell_type_col='cell_type')
File "/storage/groups/ml01/workspace/karin.hrovatin//code/diabetes_analysis/integration/2_evaluation_script.py", line 185, in compute_metrics
metrics['graph_iLISI'], metrics['graph_cLISI'] = sm.lisi_graph(adata=latent_adata, batch_key=batch_col,
File "/home/icb/karin.hrovatin/scib/scIB/metrics.py", line 1511, in lisi_graph
ilisi_score = lisi_graph_py(adata = adata, batch_key = batch_key,
File "/home/icb/karin.hrovatin/scib/scIB/metrics.py", line 1453, in lisi_graph_py
simpson_estimate_batch = compute_simpson_index_graph(input_path = dir_path,
File "/home/icb/karin.hrovatin/scib/scIB/metrics.py", line 1271, in compute_simpson_index_graph
if stat(input_path + '_indices_'+ str(chunk_no) + '.txt').st_size == 0:
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/lisi_tmp1603868666/_indices_0.txt'
Yes, we just merged a PR with a fix. Please update your scIB package and get back to us if the error persists.
Yes, I still get it:
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/home/icb/karin.hrovatin/miniconda3/envs/rpy2_3/lib/python3.8/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/home/icb/karin.hrovatin/miniconda3/envs/rpy2_3/lib/python3.8/multiprocessing/pool.py", line 51, in starmapstar
return list(itertools.starmap(args[0], args[1]))
File "/home/icb/karin.hrovatin/scib/scIB/metrics.py", line 1277, in compute_simpson_index_graph
if stat(input_path + '_indices_'+ str(chunk_no) + '.txt').st_size == 0:
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/lisi_tmp1603881346/_indices_0.txt'
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/storage/groups/ml01/workspace/karin.hrovatin//code/diabetes_analysis/integration/2_evaluation_script.py", line 259, in <module>
compute_metrics(adata_full=adata_full,latent_adata=latent_adata,metrics=metrics,cell_type_col='cell_type')
File "/storage/groups/ml01/workspace/karin.hrovatin//code/diabetes_analysis/integration/2_evaluation_script.py", line 185, in compute_metrics
metrics['graph_iLISI'], metrics['graph_cLISI'] = sm.lisi_graph(adata=latent_adata, batch_key=batch_col,
File "/home/icb/karin.hrovatin/scib/scIB/metrics.py", line 1522, in lisi_graph
ilisi_score = lisi_graph_py(adata = adata, batch_key = batch_key,
File "/home/icb/karin.hrovatin/scib/scIB/metrics.py", line 1452, in lisi_graph_py
results = pool.starmap(compute_simpson_index_graph, zip(itertools.repeat(dir_path),
File "/home/icb/karin.hrovatin/miniconda3/envs/rpy2_3/lib/python3.8/multiprocessing/pool.py", line 372, in starmap
return self._map_async(func, iterable, starmapstar, chunksize).get()
File "/home/icb/karin.hrovatin/miniconda3/envs/rpy2_3/lib/python3.8/multiprocessing/pool.py", line 771, in get
raise self._value
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/lisi_tmp1603881346/_indices_0.txt'
OK, is this the full output that you get when you run the LISI metric? I assume that the error comes from the cpp-code, but I do not see any error message related to it.
Yes. My code is
metrics['graph_iLISI'], metrics['graph_cLISI'] = sm.lisi_graph(
adata=latent_adata, batch_key=batch_col,
label_key=cell_type_col, k0=90, type_= 'embed',
subsample=0.5*100, scale=True,
multiprocessing = True, nodes = 4, verbose=False)
Thanks. Can I ask you to re-run your code with verbose=True
?
/home/icb/karin.hrovatin/miniconda3/envs/rpy2_3/lib/python3.8/site-packages/scanpy/neighbors/__init__.py:121: FutureWarning: This location for 'distances' is deprecated. It has been moved to .obsp[distances], and will not be accesible here in a future version of anndata.
adata.uns['neighbors']['distances'] = neighbors.distances
/home/icb/karin.hrovatin/miniconda3/envs/rpy2_3/lib/python3.8/site-packages/scanpy/neighbors/__init__.py:122: FutureWarning: This location for 'connectivities' is deprecated. It has been moved to .obsp[connectivities], and will not be accesible here in a future version of anndata.
adata.uns['neighbors']['connectivities'] = neighbors.connectivities
using precomputed kNN graph
Convert nearest neighbor matrix and distances for LISI.
Compute knn on shortest paths
LISI score estimation
4 processes started.
---------------------------------------------------------------------------
RemoteTraceback Traceback (most recent call last)
RemoteTraceback:
"""
Traceback (most recent call last):
File "/home/icb/karin.hrovatin/miniconda3/envs/rpy2_3/lib/python3.8/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/home/icb/karin.hrovatin/miniconda3/envs/rpy2_3/lib/python3.8/multiprocessing/pool.py", line 51, in starmapstar
return list(itertools.starmap(args[0], args[1]))
File "/home/icb/karin.hrovatin/scib/scIB/metrics.py", line 1277, in compute_simpson_index_graph
if stat(input_path + '_indices_'+ str(chunk_no) + '.txt').st_size == 0:
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/lisi_tmp1603887050/_indices_3.txt'
"""
The above exception was the direct cause of the following exception:
FileNotFoundError Traceback (most recent call last)
<ipython-input-19-c215c286d839> in <module>
----> 1 metrics['graph_iLISI'], metrics['graph_cLISI'] = sm.lisi_graph(adata=latent_adata, batch_key='study_sample',
2 label_key='cell_type', k0=90, type_= 'embed',
3 subsample=0.5*100, scale=True,
4 multiprocessing = True,nodes = 4, verbose=True)
~/scib/scIB/metrics.py in lisi_graph(adata, batch_key, label_key, k0, type_, subsample, scale, multiprocessing, nodes, verbose)
1520
1521 #compute LISI score
-> 1522 ilisi_score = lisi_graph_py(adata = adata, batch_key = batch_key,
1523 n_neighbors = k0, perplexity=None, subsample = subsample,
1524 multiprocessing = multiprocessing, nodes = nodes, verbose=verbose)
~/scib/scIB/metrics.py in lisi_graph_py(adata, batch_key, n_neighbors, perplexity, subsample, multiprocessing, nodes, verbose)
1450
1451 #create argument list for each worker
-> 1452 results = pool.starmap(compute_simpson_index_graph, zip(itertools.repeat(dir_path),
1453 itertools.repeat(batch),
1454 itertools.repeat(n_batches),
~/miniconda3/envs/rpy2_3/lib/python3.8/multiprocessing/pool.py in starmap(self, func, iterable, chunksize)
370 `func` and (a, b) becomes func(a, b).
371 '''
--> 372 return self._map_async(func, iterable, starmapstar, chunksize).get()
373
374 def starmap_async(self, func, iterable, chunksize=None, callback=None,
~/miniconda3/envs/rpy2_3/lib/python3.8/multiprocessing/pool.py in get(self, timeout)
769 return self._value
770 else:
--> 771 raise self._value
772
773 def _set(self, i, obj):
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/lisi_tmp1603887050/_indices_3.txt'
It looks like you are not getting an output from the C++ file, but there is no error output of that run. I am assuming that _indices_3.txt
is meant to be generated by the knn_graph.cpp file. Is that correct, @mbuttner?
Could it be that /tmp/
is full and therefore you cannot store the file in /tmp/
anymore? We have been using /localscratch/
on the slurm cluster instead to avoid this.
I think there is plenty space
df /tmp/
Filesystem 1K-blocks Used Available Use% Mounted on
tmpfs 197456336 830448 196625888 1% /tmp
okay, thx... and it's also not write-protected or sth?
I do not think so..I think there was a couple of directories created there by my code, but they do not contain the necessary files:
ls /tmp/lisi_tmp16038*
/tmp/lisi_tmp1603841892:
input.mtx
/tmp/lisi_tmp1603841935:
input.mtx
/tmp/lisi_tmp1603842446:
input.mtx
/tmp/lisi_tmp1603842695:
input.mtx
/tmp/lisi_tmp1603843800:
input.mtx
Okay. So I think the input.mtx
is generated by the python part and is meant to be read in by the c++ code to generate the other outputs. That doesn't seem to be happening. I guess you have compiled the c++ code on your machine? It's strange that you don't get any output from that part of the code... we might need to add some print statements to see where it's failing... Or just run the c++ code on the input.mtx files without the pipeline.
Trying to run from python:
cpp_file_path = '/'.join(sm.__file__.split('/')[:-1])+ '/knn_graph/knn_graph.o' #create POSIX path to file to execute compiled cpp-code
#comment: POSIX path needs to be converted to string - done below with 'as_posix()'
#create evenly split chunks if n_obs is divisible by n_chunks (doesn't really make sense on 2nd thought)
n_chunks=1
n_splits = n_chunks -1
mtx_file_path='/tmp/lisi_tmp1603887050/input.mtx'
dir_path='/tmp/lisi_tmp1603887050/'
args_int = [cpp_file_path, mtx_file_path, dir_path, str(90), str(n_splits), str(100)]
subprocess.run(args_int)
CompletedProcess(args=['/home/icb/karin.hrovatin/scib/scIB/knn_graph/knn_graph.o', '/tmp/lisi_tmp1603887050/input.mtx', '/tmp/lisi_tmp1603887050/', '90', '0', '100'], returncode=1)
Running from bash:
/home/icb/karin.hrovatin/scib/scIB/knn_graph/knn_graph.o /tmp/lisi_tmp1603887050/input.mtx /tmp/lisi_tmp1603887050/ 90 0 100
malformed matrix format, column number decreased 3 779 1.8001004e-01
Okay, now we have some information. This comes from here: https://github.com/theislab/scib/blob/fc65508d70e778b06a724d7cc5359f34e16c0054/scIB/knn_graph/knn_graph.cpp#L102-L106
It seems that the index order in your connectivities matrix is not contiguous in the columns. Could you check the type of adata.uns['connectivities']
? Are you using a connectivities matrix that was not generated by sc.pp.neighbors
?
I am using lisi_graph function and it does all calculation. The only thing I changed is connectivities being stored in obsp as I am using newer version of scanpy. It works sometimes, but not always.
It would be important to check if the .mtx
file you have generated has contiguous numbering in the row and column indices.. and if row numbers are the first index, or column numbers. You may also have a duplicate index in there for some reason.
So I think there might be problem with my mtx file. Do you know why this could happen or how to fix it?
Okay, so rows 2-6 have monotonically increasing column indices j, after that it goes down. Those first 5 numbers are the problem. I'm not sure why they are there. If you show the first 6 connectivities values in your adata.obsp['connectivities']
via adata.obsp['connectivities'].data[:6]
do you see those values? I'm not sure how the .mtx
file was generated here... maybe @mbuttner can say more.
Okay, so rows 2-6 have monotonically increasing column indices j, after that it goes down. Those first 5 numbers are the problem. I'm not sure why they are there. If you show the first 6 connectivities values in your
adata.obsp['connectivities']
viaadata.obsp['connectivities'].data[:6]
do you see those values? I'm not sure how the.mtx
file was generated here... maybe @mbuttner can say more.
The input.mtx
matrix is a copy of the adata.obsp['connectivities']
input.
Yes, the code is as Maren wrote. I tried to run it by hand and now it worked, but I am not sure what happens sometimes that matrix gets malformed.
so the issue is in the write function for the .mtx
file rather than anywhere else. Is this something worth following up, or shall we close for now @mbuttner ?
I suggest that we add a check
function that ensures the correct format of the input.mtx
function.
the question is where do we need to check. In the adata.uns['connectivities']
or in the mtx file. It would have been good to see the first connectivities values in the anndata object where it fails... but now it's working again ^^.
It is working only when I run the code by hand line by line. If I try to execute it with scib package it does not work.
I had the following error occur when calculating the LISI metric:
It only happens for one method (scanorama full) so it's not a general problem but I'm guessing something to do with the integration output.