theislab / scib

Benchmarking analysis of data integration tools
MIT License
300 stars 63 forks source link

Error in LISI metric #178

Closed lazappi closed 4 years ago

lazappi commented 4 years ago

I had the following error occur when calculating the LISI metric:

Convert nearest neighbor matrix and distances for LISI.
Compute knn on shortest paths
LISI score estimation
12 processes started.
​
/Users/luke.zappia/miniconda3/envs/scIB-python/lib/python3.7/site-packages/anndata/_core/anndata.py:21: FutureWarning: pandas.core.index is deprecated and will be removed in a future version.  The public classes are available in the top-level namespace.
  from pandas.core.index import RangeIndex
Trying to set attribute `.obs` of view, copying.
Trying to set attribute `.obs` of view, copying.
Trying to set attribute `.obs` of view, copying.
Trying to set attribute `.obs` of view, copying.
malformed matrix line 83 2661 9.753568422904793e-309
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "/Users/luke.zappia/miniconda3/envs/scIB-python/lib/python3.7/multiprocessing/pool.py", line 121, in worker
    result = (True, func(*args, **kwds))
  File "/Users/luke.zappia/miniconda3/envs/scIB-python/lib/python3.7/multiprocessing/pool.py", line 47, in starmapstar
    return list(itertools.starmap(args[0], args[1]))
  File "/Users/luke.zappia/miniconda3/envs/scIB-python/lib/python3.7/site-packages/scIB/metrics.py", line 1259, in compute_simpson_index_graph
    if stat(input_path + '_indices_'+ str(chunk_no) + '.txt').st_size == 0:
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/lisi_tmp1601377709/_indices_1.txt'
"""
​
The above exception was the direct cause of the following exception:
​
Traceback (most recent call last):
  File "scripts/metrics.py", line 242, in <module>
    trajectory_=trajectory_
  File "/Users/luke.zappia/miniconda3/envs/scIB-python/lib/python3.7/site-packages/scIB/metrics.py", line 1876, in metrics
    multiprocessing = True, verbose=verbose)
  File "/Users/luke.zappia/miniconda3/envs/scIB-python/lib/python3.7/site-packages/scIB/metrics.py", line 1501, in lisi_graph
    multiprocessing = multiprocessing, nodes = nodes, verbose=verbose)
  File "/Users/luke.zappia/miniconda3/envs/scIB-python/lib/python3.7/site-packages/scIB/metrics.py", line 1434, in lisi_graph_py
    count))
  File "/Users/luke.zappia/miniconda3/envs/scIB-python/lib/python3.7/multiprocessing/pool.py", line 276, in starmap
    return self._map_async(func, iterable, starmapstar, chunksize).get()
  File "/Users/luke.zappia/miniconda3/envs/scIB-python/lib/python3.7/multiprocessing/pool.py", line 657, in get
    raise self._value
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/lisi_tmp1601377709/_indices_1.txt'

It only happens for one method (scanorama full) so it's not a general problem but I'm guessing something to do with the integration output.

LuckyMD commented 4 years ago

Maybe it has to do with scipy versions then?

LuckyMD commented 4 years ago

and does your adata.obsp['connectivities'].data[:10] look like the faulty .mtx file or the a correct version of it?

Hrovatin commented 4 years ago

Not scanpy. lisi_graph_py works, but lisi_graph not. Will post notebook shortly.

Hrovatin commented 4 years ago

Example notebook below (just realised it does not display properly, will send you another format on chat). I do not understand why lisi_graph fails, but lisi_graph_py does not. To best of my knowledge they should be the same. My lisi_graph code (same as scIB, but some additional debug args): https://github.com/Hrovatin/scib/blob/6bf4498b336a7547b04333d9dd6e694eef159d40/scIB/metrics.py#L1481 troubleshoot_scib_lisi.pdf

Last three generated lisi directories (from the troubleshooting script) and input.mtx files. There seem to be an error in the file generated by lisi_graph as it has less entries and malformed index.


drwxr-xr-x. 2 karin.hrovatin  OG-ICB-User  60 Oct 30 13:59 lisi_tmp1604062755
drwxr-xr-x. 2 karin.hrovatin  OG-ICB-User 220 Oct 30 13:59 lisi_tmp1604062778
drwxr-xr-x. 2 karin.hrovatin  OG-ICB-User 220 Oct 30 14:00 lisi_tmp1604062803

(rpy2_3) [karin.hrovatin@icb-mona scib]$ ls /tmp/lisi_tmp1604062755
input.mtx
(rpy2_3) [karin.hrovatin@icb-mona scib]$ ls /tmp/lisi_tmp1604062778/
_distances_0.txt  _distances_1.txt  _distances_2.txt  _distances_3.txt  _indices_0.txt  _indices_1.txt  _indices_2.txt  _indices_3.txt  input.mtx
(rpy2_3) [karin.hrovatin@icb-mona scib]$ ls /tmp/lisi_tmp1604062803
_distances_0.txt  _distances_1.txt  _distances_2.txt  _distances_3.txt  _indices_0.txt  _indices_1.txt  _indices_2.txt  _indices_3.txt  input.mtx

(rpy2_3) [karin.hrovatin@icb-mona scib]$ head /tmp/lisi_tmp1604062755/input.mtx 
%%MatrixMarket matrix coordinate real general
%
43356 43356 835990
1 11440 2.4849173e-01
1 15390 2.1628287e-01
1 16562 2.2891411e-01
1 17211 1.0053829e-01
1 17863 4.0131930e-01
1 90 2.7065918e-01
1 217 6.6888779e-01
(rpy2_3) [karin.hrovatin@icb-mona scib]$ head /tmp/lisi_tmp1604062778/input.mtx 
%%MatrixMarket matrix coordinate real general
%
43356 43356 908334
1 90 2.7065906e-01
1 217 6.6888750e-01
1 511 3.3287230e-01
1 928 8.5889757e-02
1 1084 1.5068726e-01
1 1416 1.0888591e-01
1 1833 1.8670695e-01
(rpy2_3) [karin.hrovatin@icb-mona scib]$ head /tmp/lisi_tmp1604062803/input.mtx 
%%MatrixMarket matrix coordinate real general
%
43356 43356 908334
1 90 2.7065906e-01
1 217 6.6888750e-01
1 511 3.3287230e-01
1 928 8.5889757e-02
1 1084 1.5068726e-01
1 1416 1.0888591e-01
1 1833 1.8670695e-01
(rpy2_3) [karin.hrovatin@icb-mona scib]$ 
Hrovatin commented 4 years ago

Found it I think! (see adata and adata_tmp in the below code) https://github.com/theislab/scib/blob/20b18f1f6627f16a72d25e1f08c092d901af1ccf/scIB/metrics.py#L1510


    if (type_ == 'embed'):
        adata_tmp = sc.pp.neighbors(adata,n_neighbors=15, use_rep = 'X_emb', copy=True)
    if (type_ == 'full'):
        if 'X_pca' not in adata.obsm.keys():
            sc.pp.pca(adata, svd_solver = 'arpack')
        adata_tmp = sc.pp.neighbors(adata, n_neighbors=15, copy=True)
    else:
        adata_tmp = adata.copy()
    #if knn - do not compute a new neighbourhood graph (it exists already)

    #compute LISI score
    ilisi_score = lisi_graph_py(adata = adata, batch_key = batch_key, 
                  n_neighbors = k0, perplexity=None, subsample = subsample, 
                  multiprocessing = multiprocessing, nodes = nodes, verbose=verbose)

    clisi_score = lisi_graph_py(adata = adata, batch_key = label_key, 
                  n_neighbors = k0, perplexity=None, subsample = subsample, 
                  multiprocessing = multiprocessing, nodes = nodes, verbose=verbose)
mbuttner commented 4 years ago

Thanks! Good you found it. I'll provide a fix.

LuckyMD commented 4 years ago

lisi_graph_py is the python version of lisi, right @mbuttner? We're not running that in our pipeline atm.

What is the issue here exactly? All I see is a comma missing in the ilisi_score = lisi_graph_py(adata = adata line.

mbuttner commented 4 years ago

We create a adata_tmp object, where we recompute neighbors, but we use the adata in the lisi_graph_py call instead.

Hrovatin commented 4 years ago

If you use type_ == 'embed' it should generate new adata_tmp with neighbours recomputed. But lisi_graph_py is then computed on input adata

(The missing comma is because I tried to put adata in bold in markdown, but does not work with code, will correct it)

LuckyMD commented 4 years ago

This is now fixed in #201. The fix is only important if you add subsetting in the metrics script, as otherwise the neighborhood graph is already computed in that script and so technically doesn't require recomputation. Thus, we don't have to rerun anything.