theislab / scib

Benchmarking analysis of data integration tools
MIT License
300 stars 63 forks source link

Error in LISI metric #178

Closed lazappi closed 4 years ago

lazappi commented 4 years ago

I had the following error occur when calculating the LISI metric:

Convert nearest neighbor matrix and distances for LISI.
Compute knn on shortest paths
LISI score estimation
12 processes started.
​
/Users/luke.zappia/miniconda3/envs/scIB-python/lib/python3.7/site-packages/anndata/_core/anndata.py:21: FutureWarning: pandas.core.index is deprecated and will be removed in a future version.  The public classes are available in the top-level namespace.
  from pandas.core.index import RangeIndex
Trying to set attribute `.obs` of view, copying.
Trying to set attribute `.obs` of view, copying.
Trying to set attribute `.obs` of view, copying.
Trying to set attribute `.obs` of view, copying.
malformed matrix line 83 2661 9.753568422904793e-309
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "/Users/luke.zappia/miniconda3/envs/scIB-python/lib/python3.7/multiprocessing/pool.py", line 121, in worker
    result = (True, func(*args, **kwds))
  File "/Users/luke.zappia/miniconda3/envs/scIB-python/lib/python3.7/multiprocessing/pool.py", line 47, in starmapstar
    return list(itertools.starmap(args[0], args[1]))
  File "/Users/luke.zappia/miniconda3/envs/scIB-python/lib/python3.7/site-packages/scIB/metrics.py", line 1259, in compute_simpson_index_graph
    if stat(input_path + '_indices_'+ str(chunk_no) + '.txt').st_size == 0:
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/lisi_tmp1601377709/_indices_1.txt'
"""
​
The above exception was the direct cause of the following exception:
​
Traceback (most recent call last):
  File "scripts/metrics.py", line 242, in <module>
    trajectory_=trajectory_
  File "/Users/luke.zappia/miniconda3/envs/scIB-python/lib/python3.7/site-packages/scIB/metrics.py", line 1876, in metrics
    multiprocessing = True, verbose=verbose)
  File "/Users/luke.zappia/miniconda3/envs/scIB-python/lib/python3.7/site-packages/scIB/metrics.py", line 1501, in lisi_graph
    multiprocessing = multiprocessing, nodes = nodes, verbose=verbose)
  File "/Users/luke.zappia/miniconda3/envs/scIB-python/lib/python3.7/site-packages/scIB/metrics.py", line 1434, in lisi_graph_py
    count))
  File "/Users/luke.zappia/miniconda3/envs/scIB-python/lib/python3.7/multiprocessing/pool.py", line 276, in starmap
    return self._map_async(func, iterable, starmapstar, chunksize).get()
  File "/Users/luke.zappia/miniconda3/envs/scIB-python/lib/python3.7/multiprocessing/pool.py", line 657, in get
    raise self._value
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/lisi_tmp1601377709/_indices_1.txt'

It only happens for one method (scanorama full) so it's not a general problem but I'm guessing something to do with the integration output.

LuckyMD commented 4 years ago

So either this is a malformed matrix line 83 2661 9.753568422904793e-309 error in the multiprocessing part that is done in C, and then a particular file can't be found by the downstream analysis in python.... or you just don't have any files in /tmp/lisi_tmp1601377709. Could you check if there are any files in there?

Otherwise, @mbuttner... do you have any ideas?

lazappi commented 4 years ago

The temp directory has an input.mtx file but none of the _indices* or _distances* files that my other LISI temp directories have.

mbuttner commented 4 years ago

It seems like the knn-graph has not been computed by the cpp function. That's why the different 'chunk files' are not present.

LuckyMD commented 4 years ago

where could the line malformed matrix line 83 2661 9.753568422904793e-309 be coming from? Not from the cpp function? And if one of the cpp function workers fails, would the others still produce output files?

mbuttner commented 4 years ago

How does the input.mtx file look like? In the LISI-code, I used

 mmwrite(mtx_file_path,
            adata.uns['neighbors']['connectivities'],
            symmetry='general')

To create the input.mtx file. If it is empty (or nonsense), that might explain why the cpp function did not produce any output.

mbuttner commented 4 years ago

where could the line malformed matrix line 83 2661 9.753568422904793e-309 be coming from? Not from the cpp function? And if one of the cpp function workers fails, would the others still produce output files?

The cpp function runs only once to compute the knn-graph on the full data matrix and splits the output (an index list) into chunks. So that has nothing to do with the different workers.

The malformed matrix line error (I haven't seen it yet), might be caused by the 'file not found error'. That's why I suspect the error to be caused by an erroneous input to the cpp function.

mbuttner commented 4 years ago

What is your scanpy version? Is the nearest neighbor graph still located at adata.uns['neigbors']['connectivities'] or do you have an .obsp object instead?

LuckyMD commented 4 years ago

I would presume that the cpp run failed due to the malformed matrix line error and therefore produced no output (if it's just a single job where 1 error means no files are generated), which meant the error arose that no input file was available to read to further the calculations. But of course... I have no idea here 😄

lazappi commented 4 years ago

How does the input.mtx file look like?

It looks ok as far as I can tell

>>> mat = mmread('input.mtx')
>>> mat
<4314x4314 sparse matrix of type '<class 'numpy.float64'>'
    with 86450 stored elements in COOrdinate format>
>>> mat_dense = mat.todense()
>>> mat_dense
matrix([[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]])
>>> scipy.stats.describe(mat_dense)
DescribeResult(nobs=4314, minmax=(array([0., 0., 0., ..., 0., 0., 0.]), array([1., 1., 1., ..., 1., 1., 1.])), mean=array([0.00149025, 0.00100387, 0.00126313, ..., 0.00111166, 0.00090563,
       0.00126314]), variance=array([0.00086184, 0.0005056 , 0.00051443, ..., 0.00070479, 0.00040159,
       0.00078288]), skewness=array([28.81485849, 30.74991031, 27.84944371, ..., 31.60309792,
       35.09651265, 28.52167416]), kurtosis=array([ 938.79817136, 1120.46076981, 1009.07881687, ..., 1081.65372246,
       1546.08669569,  900.12986143]))
lazappi commented 4 years ago

What is your scanpy version? Is the nearest neighbor graph still located at adata.uns['neigbors']['connectivities'] or do you have an .obsp object instead?

I have scanpy v1.4.6 and anndata v0.7.1. It's only a problem for scanorama on a couple of scenarios so I think it's something to do with that integration output, not something general.

mbuttner commented 4 years ago

OK, but it puzzles me that you don't have the _indices* and _distances* files in your tmp directory. Have they been removed automatically or haven't they been created at all in the first place?

mbuttner commented 4 years ago

Coming back to the malformed matrix error: I also assume that a distance of 9e-309 is too small to be represented as a double in C++.

LuckyMD commented 4 years ago

The error is hardcoded from the C++ part:

https://github.com/theislab/scib/blob/47da3bf86ca9f9b8d9613230bd967cccc25b61df/scIB/knn_graph/knn_graph.cpp#L90-L105

Any idea what the error refers to?

LuckyMD commented 4 years ago

Okay, so weight is a C++ 64 bitdouble, which has limits of: -1.7E+308 to +1.7E+308 and about 16 decimal digits. The e-309 unfortunately doesn't fit into that. It seems that scanorama uses a more precise method internally for their calculations. I would probably just look for the value and set it to 0 and rerun.

The question is... should we catch this in the code or not? If we look for this error, in the LISI C++ code, we will slow down the metric further, which doesn't seem like a good idea. We could look for this in the python code though. The real question is: how does python store this value? Could you look for the offending value, @lazappi, and check how python stores it internally? At least we know it should be the value in adata.X[adata.obs_names[83],adata.var_names[2661]].

lazappi commented 4 years ago

I can't get the indexing you suggested to work, not sure what is going on here:

>>> adata.X[adata.obs_names[83],adata.var_names[2661]]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/luke.zappia/miniconda3/envs/scIB-python/lib/python3.7/site-packages/scipy/sparse/_index.py", line 33, in __getitem__
    row, col = self._validate_indices(key)
  File "/Users/luke.zappia/miniconda3/envs/scIB-python/lib/python3.7/site-packages/scipy/sparse/_index.py", line 137, in _validate_indices
    row = self._asindices(row, M)
  File "/Users/luke.zappia/miniconda3/envs/scIB-python/lib/python3.7/site-packages/scipy/sparse/_index.py", line 161, in _asindices
    raise IndexError('Index dimension must be <= 2')
IndexError: Index dimension must be <= 2
>>> adata.obs_names[83]
'human3_lib3.final_cell_0803-0'
>>> adata.var_names[2661]
'GAS8'
>>> adata.X['human3_lib3.final_cell_0803-0', 'GAS8']
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/luke.zappia/miniconda3/envs/scIB-python/lib/python3.7/site-packages/scipy/sparse/_index.py", line 33, in __getitem__
    row, col = self._validate_indices(key)
  File "/Users/luke.zappia/miniconda3/envs/scIB-python/lib/python3.7/site-packages/scipy/sparse/_index.py", line 137, in _validate_indices
    row = self._asindices(row, M)
  File "/Users/luke.zappia/miniconda3/envs/scIB-python/lib/python3.7/site-packages/scipy/sparse/_index.py", line 161, in _asindices
    raise IndexError('Index dimension must be <= 2')
IndexError: Index dimension must be <= 2

Using numbers for indexing doesn't show anything interesting:

>>> print(adata.X[80:85, 2658:2663])
  (0, 0)    -0.002799201
  (0, 1)    -0.007948541
  (0, 2)    0.001190949
  (0, 3)    0.0025276951
  (0, 4)    -0.00058060087
  (1, 0)    -0.0064782263
  (1, 1)    -0.00036682197
  (1, 2)    -0.004223289
  (1, 3)    0.014151698
  (1, 4)    -0.0005493724
  (2, 0)    -0.0060044294
  (2, 1)    -0.0008786302
  (2, 2)    -0.0067326566
  (2, 3)    0.024539318
  (2, 4)    -0.0005234006
  (3, 0)    -0.0020147967
  (3, 1)    0.02790912
  (3, 2)    0.0010080308
  (3, 3)    0.0024890346
  (3, 4)    -0.00047112707
  (4, 0)    0.014065156
  (4, 1)    0.011179721
  (4, 2)    0.0041127596
  (4, 3)    0.0038861749
  (4, 4)    -0.000660761

Looking for the smallest absolute value in X:

>>> x = adata.X.toarray()
>>> x_abs = np.absolute(x)
>>> x_abs.argmin()
22979577
>>> x.flatten()[22979577]
-4.0687386e-13

I tried to run lisi_graph() manually to verify the error but I couldn't quite get it working.

LuckyMD commented 4 years ago

The indexing error is just numpy being numpy. I forgot you'd have to index separately and before the .X by: adata[adata.obs_names[83],:][:,adata.var_names[2661]].X. Otherwise, using the numbers would have been better. So adata.X[83,2661]` should work i guess.

mbuttner commented 4 years ago

Can you also check for the smallest value in adata.uns['neighbors']['connectivities']?

LuckyMD commented 4 years ago

It should then be: (3, 3) 0.0024890346 in your list... Ah... just saw @mbuttner's comment... these are connectivity values, and not in adata.X

lazappi commented 4 years ago

The neighbours aren't stored in the object so I had to compute them. I tried to do that the same way as the script but might be a bit different.

>>> scIB.preprocessing.reduce_data(adata_int, n_top_genes=None, neighbors=True, use_rep='X_pca', pca=True, umap=False)

Finding the smallest non-zero connectivity values:

>>> x = adata_int.uns['neighbors']['connectivities'].toarray()
>>> i, j = np.where(x==np.min(x[np.nonzero(x)]))
>>> x[i, j]
array([2.65916e-318, 2.65916e-318])

There are definitely some tiny values there. For some reason the indices are different though.

>>> i
array([2851, 3974])
>>> j
array([3974, 2851])
>>> x[i, j]

As far as I can tell these are stored as Python floats (but I'm not really sure how to check this).

>>> x_list = x[i, j].tolist()
>>> x_list
[2.65916e-318, 2.65916e-318]
>>> type(x_list[0])
<class 'float'>

Float info for my system (from here https://stackoverflow.com/questions/1835787/what-is-the-range-of-values-a-float-can-have-in-python):

>>> sys.float_info
sys.float_info(max=1.7976931348623157e+308, max_exp=1024, max_10_exp=308, min=2.2250738585072014e-308, min_exp=-1021, min_10_exp=-307, dig=15, mant_dig=53, epsilon=2.220446049250313e-16, radix=2, rounds=1)
# Normalised min (with loss of precision)
>>> sys.float_info.min * sys.float_info.epsilon
5e-324
mbuttner commented 4 years ago

Hmm. Interesting. When you use scanorama, you obtain 2 different outputs (an embedding, stored in X_emb and the corrected data matrix. I assume that when you compute the neighbors based on X_pca, the distances are different from the ones when you use X_emb.

mbuttner commented 4 years ago

Otherwise, the distances are pretty low, so we could write a function that shifts all values that are below 1e-308 to that minimum/maximum in the python script before we run the C++ code. I think that this shouldn't take much time/memory, but might distort the neighborhood graph (esp. if there are a couple of these small values). Another option would be to multiply all values with a sufficiently large constant such that the we keep the overall relation of the neighbors.

LuckyMD commented 4 years ago

There are definitely some tiny values there. For some reason the indices are different though.

that's because these are even smaller values than the error value you had before... that was probably just the first value that was smaller than 1e-308. If you check the 83,2661 value in the connectivity matrix, you'll probably find it's the value depicted above. Very curious that python can store floats smaller than your system claims possible.

I think I would go with Maren's "set to zero" suggestion. I wonder if this error might not occur on a different system... but then this is already 64-bit floats... so no idea.

lazappi commented 4 years ago

That seems like a reasonable solution to me. I assume that connectivities that small won't make any meaningful difference to the scores. Maybe print a message though so you can tell it's happening.

LuckyMD commented 4 years ago

We would then just need to change the error message in the knn_graph.cpp file I guess. However the error catching there is pretty generic, so hard to give a specific error. Maybe we just add something like "Warning: this means that no file is generated in knn_graph.o and will thus lead to metric errors in the LISI metric". And maybe add something about this might have to do with exceedingly small or large connectivites.

lazappi commented 4 years ago

I was thinking you would do the check in Python and if any values are too small set them to zero and print something like "Warning: some connectivities are too small and have been set to zero, this may effect the values for LISI scores". That way you should still get a score rather than an error. Or is that what you are suggesting?

As an aside it's a real pain as a user if one of the metrics fails. You can't turn off an individual metric and you can't really skip the specific scenario that has the error. It would almost be better to return an NA for any metric errors rather than failing.

LuckyMD commented 4 years ago

That is true.. we could check in python beforehand. I was thinking just improve the error message where it's thrown (in the C++ code).

We would just have to generally catch errors in the metrics() function. However, in that case you would not know what the error was... or we print the error but assign NA instead of raising it. You can turn off a metric, but only by going into the code...

Hrovatin commented 4 years ago

Is there a solution for this already? I am currently getting the same error on almost all of my scArches/scVI runs.

ls /tmp/lisi_tmp1603868666/
input.mtx

Traceback (most recent call last):
  File "/storage/groups/ml01/workspace/karin.hrovatin//code/diabetes_analysis/integration/2_evaluation_script.py", line 259, in <module>
    compute_metrics(adata_full=adata_full,latent_adata=latent_adata,metrics=metrics,cell_type_col='cell_type')
  File "/storage/groups/ml01/workspace/karin.hrovatin//code/diabetes_analysis/integration/2_evaluation_script.py", line 185, in compute_metrics
    metrics['graph_iLISI'], metrics['graph_cLISI'] = sm.lisi_graph(adata=latent_adata, batch_key=batch_col, 
  File "/home/icb/karin.hrovatin/scib/scIB/metrics.py", line 1511, in lisi_graph
    ilisi_score = lisi_graph_py(adata = adata, batch_key = batch_key, 
  File "/home/icb/karin.hrovatin/scib/scIB/metrics.py", line 1453, in lisi_graph_py
    simpson_estimate_batch = compute_simpson_index_graph(input_path = dir_path, 
  File "/home/icb/karin.hrovatin/scib/scIB/metrics.py", line 1271, in compute_simpson_index_graph
    if stat(input_path + '_indices_'+ str(chunk_no) + '.txt').st_size == 0:
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/lisi_tmp1603868666/_indices_0.txt'
mbuttner commented 4 years ago

Yes, we just merged a PR with a fix. Please update your scIB package and get back to us if the error persists.

Hrovatin commented 4 years ago

Yes, I still get it:


multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/home/icb/karin.hrovatin/miniconda3/envs/rpy2_3/lib/python3.8/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/home/icb/karin.hrovatin/miniconda3/envs/rpy2_3/lib/python3.8/multiprocessing/pool.py", line 51, in starmapstar
    return list(itertools.starmap(args[0], args[1]))
  File "/home/icb/karin.hrovatin/scib/scIB/metrics.py", line 1277, in compute_simpson_index_graph
    if stat(input_path + '_indices_'+ str(chunk_no) + '.txt').st_size == 0:
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/lisi_tmp1603881346/_indices_0.txt'
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/storage/groups/ml01/workspace/karin.hrovatin//code/diabetes_analysis/integration/2_evaluation_script.py", line 259, in <module>
    compute_metrics(adata_full=adata_full,latent_adata=latent_adata,metrics=metrics,cell_type_col='cell_type')
  File "/storage/groups/ml01/workspace/karin.hrovatin//code/diabetes_analysis/integration/2_evaluation_script.py", line 185, in compute_metrics
    metrics['graph_iLISI'], metrics['graph_cLISI'] = sm.lisi_graph(adata=latent_adata, batch_key=batch_col, 
  File "/home/icb/karin.hrovatin/scib/scIB/metrics.py", line 1522, in lisi_graph
    ilisi_score = lisi_graph_py(adata = adata, batch_key = batch_key, 
  File "/home/icb/karin.hrovatin/scib/scIB/metrics.py", line 1452, in lisi_graph_py
    results = pool.starmap(compute_simpson_index_graph, zip(itertools.repeat(dir_path),
  File "/home/icb/karin.hrovatin/miniconda3/envs/rpy2_3/lib/python3.8/multiprocessing/pool.py", line 372, in starmap
    return self._map_async(func, iterable, starmapstar, chunksize).get()
  File "/home/icb/karin.hrovatin/miniconda3/envs/rpy2_3/lib/python3.8/multiprocessing/pool.py", line 771, in get
    raise self._value
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/lisi_tmp1603881346/_indices_0.txt'
mbuttner commented 4 years ago

OK, is this the full output that you get when you run the LISI metric? I assume that the error comes from the cpp-code, but I do not see any error message related to it.

Hrovatin commented 4 years ago

Yes. My code is


metrics['graph_iLISI'], metrics['graph_cLISI'] = sm.lisi_graph(
                                             adata=latent_adata, batch_key=batch_col, 
                                             label_key=cell_type_col, k0=90, type_= 'embed',
                                             subsample=0.5*100, scale=True,
                                             multiprocessing = True, nodes = 4, verbose=False)
mbuttner commented 4 years ago

Thanks. Can I ask you to re-run your code with verbose=True?

Hrovatin commented 4 years ago
/home/icb/karin.hrovatin/miniconda3/envs/rpy2_3/lib/python3.8/site-packages/scanpy/neighbors/__init__.py:121: FutureWarning: This location for 'distances' is deprecated. It has been moved to .obsp[distances], and will not be accesible here in a future version of anndata.
  adata.uns['neighbors']['distances'] = neighbors.distances
/home/icb/karin.hrovatin/miniconda3/envs/rpy2_3/lib/python3.8/site-packages/scanpy/neighbors/__init__.py:122: FutureWarning: This location for 'connectivities' is deprecated. It has been moved to .obsp[connectivities], and will not be accesible here in a future version of anndata.
  adata.uns['neighbors']['connectivities'] = neighbors.connectivities

using precomputed kNN graph
Convert nearest neighbor matrix and distances for LISI.
Compute knn on shortest paths
LISI score estimation
4 processes started.

---------------------------------------------------------------------------
RemoteTraceback                           Traceback (most recent call last)
RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/home/icb/karin.hrovatin/miniconda3/envs/rpy2_3/lib/python3.8/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/home/icb/karin.hrovatin/miniconda3/envs/rpy2_3/lib/python3.8/multiprocessing/pool.py", line 51, in starmapstar
    return list(itertools.starmap(args[0], args[1]))
  File "/home/icb/karin.hrovatin/scib/scIB/metrics.py", line 1277, in compute_simpson_index_graph
    if stat(input_path + '_indices_'+ str(chunk_no) + '.txt').st_size == 0:
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/lisi_tmp1603887050/_indices_3.txt'
"""

The above exception was the direct cause of the following exception:

FileNotFoundError                         Traceback (most recent call last)
<ipython-input-19-c215c286d839> in <module>
----> 1 metrics['graph_iLISI'], metrics['graph_cLISI'] = sm.lisi_graph(adata=latent_adata, batch_key='study_sample', 
      2                                                                    label_key='cell_type', k0=90, type_= 'embed',
      3                                       subsample=0.5*100, scale=True,
      4                                             multiprocessing = True,nodes = 4, verbose=True)

~/scib/scIB/metrics.py in lisi_graph(adata, batch_key, label_key, k0, type_, subsample, scale, multiprocessing, nodes, verbose)
   1520 
   1521     #compute LISI score
-> 1522     ilisi_score = lisi_graph_py(adata = adata, batch_key = batch_key, 
   1523                   n_neighbors = k0, perplexity=None, subsample = subsample,
   1524                   multiprocessing = multiprocessing, nodes = nodes, verbose=verbose)

~/scib/scIB/metrics.py in lisi_graph_py(adata, batch_key, n_neighbors, perplexity, subsample, multiprocessing, nodes, verbose)
   1450 
   1451         #create argument list for each worker
-> 1452         results = pool.starmap(compute_simpson_index_graph, zip(itertools.repeat(dir_path),
   1453                                                                 itertools.repeat(batch),
   1454                                                                 itertools.repeat(n_batches),

~/miniconda3/envs/rpy2_3/lib/python3.8/multiprocessing/pool.py in starmap(self, func, iterable, chunksize)
    370         `func` and (a, b) becomes func(a, b).
    371         '''
--> 372         return self._map_async(func, iterable, starmapstar, chunksize).get()
    373 
    374     def starmap_async(self, func, iterable, chunksize=None, callback=None,

~/miniconda3/envs/rpy2_3/lib/python3.8/multiprocessing/pool.py in get(self, timeout)
    769             return self._value
    770         else:
--> 771             raise self._value
    772 
    773     def _set(self, i, obj):

FileNotFoundError: [Errno 2] No such file or directory: '/tmp/lisi_tmp1603887050/_indices_3.txt'
LuckyMD commented 4 years ago

It looks like you are not getting an output from the C++ file, but there is no error output of that run. I am assuming that _indices_3.txt is meant to be generated by the knn_graph.cpp file. Is that correct, @mbuttner?

Could it be that /tmp/ is full and therefore you cannot store the file in /tmp/ anymore? We have been using /localscratch/ on the slurm cluster instead to avoid this.

Hrovatin commented 4 years ago

I think there is plenty space


df /tmp/
Filesystem     1K-blocks   Used Available Use% Mounted on
tmpfs          197456336 830448 196625888   1% /tmp
LuckyMD commented 4 years ago

okay, thx... and it's also not write-protected or sth?

Hrovatin commented 4 years ago

I do not think so..I think there was a couple of directories created there by my code, but they do not contain the necessary files:


ls /tmp/lisi_tmp16038*
/tmp/lisi_tmp1603841892:
input.mtx

/tmp/lisi_tmp1603841935:
input.mtx

/tmp/lisi_tmp1603842446:
input.mtx

/tmp/lisi_tmp1603842695:
input.mtx

/tmp/lisi_tmp1603843800:
input.mtx
LuckyMD commented 4 years ago

Okay. So I think the input.mtx is generated by the python part and is meant to be read in by the c++ code to generate the other outputs. That doesn't seem to be happening. I guess you have compiled the c++ code on your machine? It's strange that you don't get any output from that part of the code... we might need to add some print statements to see where it's failing... Or just run the c++ code on the input.mtx files without the pipeline.

Hrovatin commented 4 years ago

Trying to run from python:

cpp_file_path = '/'.join(sm.__file__.split('/')[:-1])+ '/knn_graph/knn_graph.o' #create POSIX path to file to execute compiled cpp-code 
#comment: POSIX path needs to be converted to string - done below with 'as_posix()'
#create evenly split chunks if n_obs is divisible by n_chunks (doesn't really make sense on 2nd thought)
n_chunks=1
n_splits = n_chunks -1
mtx_file_path='/tmp/lisi_tmp1603887050/input.mtx'
dir_path='/tmp/lisi_tmp1603887050/'
args_int = [cpp_file_path, mtx_file_path, dir_path, str(90), str(n_splits), str(100)]
subprocess.run(args_int)

CompletedProcess(args=['/home/icb/karin.hrovatin/scib/scIB/knn_graph/knn_graph.o', '/tmp/lisi_tmp1603887050/input.mtx', '/tmp/lisi_tmp1603887050/', '90', '0', '100'], returncode=1)

Running from bash:


 /home/icb/karin.hrovatin/scib/scIB/knn_graph/knn_graph.o /tmp/lisi_tmp1603887050/input.mtx /tmp/lisi_tmp1603887050/ 90  0 100
malformed matrix format, column number decreased 3 779 1.8001004e-01
LuckyMD commented 4 years ago

Okay, now we have some information. This comes from here: https://github.com/theislab/scib/blob/fc65508d70e778b06a724d7cc5359f34e16c0054/scIB/knn_graph/knn_graph.cpp#L102-L106

It seems that the index order in your connectivities matrix is not contiguous in the columns. Could you check the type of adata.uns['connectivities']? Are you using a connectivities matrix that was not generated by sc.pp.neighbors?

Hrovatin commented 4 years ago

I am using lisi_graph function and it does all calculation. The only thing I changed is connectivities being stored in obsp as I am using newer version of scanpy. It works sometimes, but not always.

LuckyMD commented 4 years ago

It would be important to check if the .mtx file you have generated has contiguous numbering in the row and column indices.. and if row numbers are the first index, or column numbers. You may also have a duplicate index in there for some reason.

Hrovatin commented 4 years ago

So I think there might be problem with my mtx file. Do you know why this could happen or how to fix it?

image
LuckyMD commented 4 years ago

Okay, so rows 2-6 have monotonically increasing column indices j, after that it goes down. Those first 5 numbers are the problem. I'm not sure why they are there. If you show the first 6 connectivities values in your adata.obsp['connectivities'] via adata.obsp['connectivities'].data[:6] do you see those values? I'm not sure how the .mtx file was generated here... maybe @mbuttner can say more.

mbuttner commented 4 years ago

Okay, so rows 2-6 have monotonically increasing column indices j, after that it goes down. Those first 5 numbers are the problem. I'm not sure why they are there. If you show the first 6 connectivities values in your adata.obsp['connectivities'] via adata.obsp['connectivities'].data[:6] do you see those values? I'm not sure how the .mtx file was generated here... maybe @mbuttner can say more.

The input.mtx matrix is a copy of the adata.obsp['connectivities'] input.

Hrovatin commented 4 years ago

Yes, the code is as Maren wrote. I tried to run it by hand and now it worked, but I am not sure what happens sometimes that matrix gets malformed.

LuckyMD commented 4 years ago

so the issue is in the write function for the .mtx file rather than anywhere else. Is this something worth following up, or shall we close for now @mbuttner ?

mbuttner commented 4 years ago

I suggest that we add a check function that ensures the correct format of the input.mtx function.

LuckyMD commented 4 years ago

the question is where do we need to check. In the adata.uns['connectivities'] or in the mtx file. It would have been good to see the first connectivities values in the anndata object where it fails... but now it's working again ^^.

Hrovatin commented 4 years ago

It is working only when I run the code by hand line by line. If I try to execute it with scib package it does not work.