theislab / scib

Benchmarking analysis of data integration tools
MIT License
294 stars 63 forks source link

Issue with metrics function #253

Closed axelalmet closed 3 years ago

axelalmet commented 3 years ago

Hello,

Firstly, thank you for creating this package and the accompanying benchmarking paper. The paper has been an invaluable resource for myself and colleagues over the past year!

I have been trying to make use of scIB.me.metrics for my own work, which involves multiple RNA-seq datasets of murine tissue. The command I have been running is:

results = scIB.me.metrics(wound_merged, wound_merged_int_hvg, verbose=True,
                              hvg_score_=False, cluster_nmi='integrated_skin_cluster_nmi.txt',
                              batch_key='sample', label_key='leiden',
                              silhouette_=True, embed='X_emb',
                              type_ = 'embed', 
                              nmi_=True, nmi_method='arithmetic', nmi_dir=None,
                              ari_=True,
                              pcr_=True,
                              cell_cycle_=True, organism='mouse',
                              isolated_labels_=True, n_isolated=None,
                              graph_conn_=True,
                              kBET_=True,
                              #lisi_=lisi_,
                              lisi_graph_= True,
                              trajectory_=False
                             )

Everything seems to work fine until the final metric, which invokes lisi_graph, at which point I get the following error:

Traceback (most recent call last):
  File "runintegrationmetrics.py", line 60, in <module>
    results = scIB.me.metrics(wound_merged_full_uncorrected, wound_merged_full_hvg, verbose=True,
  File "/usr/local/lib/python3.8/site-packages/scIB/metrics.py", line 1956, in metrics
    ilisi_raw, clisi_raw = lisi_graph(adata_int, batch_key=batch_key, label_key=label_key,
  File "/usr/local/lib/python3.8/site-packages/scIB/metrics.py", line 1530, in lisi_graph
    ilisi_score = lisi_graph_py(adata = adata_tmp, batch_key = batch_key, 
  File "/usr/local/lib/python3.8/site-packages/scIB/metrics.py", line 1445, in lisi_graph_py
    subprocess.run(args_int)
  File "/usr/local/Cellar/python@3.8/3.8.11/Frameworks/Python.framework/Versions/3.8/lib/python3.8/subprocess.py", line 493, in run
    with Popen(*popenargs, **kwargs) as process:
  File "/usr/local/Cellar/python@3.8/3.8.11/Frameworks/Python.framework/Versions/3.8/lib/python3.8/subprocess.py", line 858, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "/usr/local/Cellar/python@3.8/3.8.11/Frameworks/Python.framework/Versions/3.8/lib/python3.8/subprocess.py", line 1704, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
OSError: [Errno 8] Exec format error: '/usr/local/lib/python3.8/site-packages/scIB/knn_graph/knn_graph.o'

If you have any advice on how to resolve this error, I would greatly appreciate it!

In addition, I have some additional questions about other functions called during scIB.me.metrics:

  1. I understand NMI compares clustering via the Louvain algorithm. Does it make a difference if I originally clustered my data using the Leiden algorithm?
  2. In the function lisi_graph, for the following command:
        adata_tmp = sc.pp.neighbors(adata,n_neighbors=15, use_rep = 'X_emb', copy=True)

    why is n_neighbors=15? Does it affect results if I calculated a kNN graph for my own data using n_neighbors=30'? Also, shoulduse_repnot accept the argumentembed, as accepted in themetrics` function?

Thank you very much for your time and help.

Best wishes, Axel.

LuckyMD commented 3 years ago

Hi @axelalmet,

I'm glad that the package and paper have been useful to you so far.

If you have any advice on how to resolve this error, I would greatly appreciate it!

The lisi error looks like it has to do with the C++ file we have added to speed up this metric. To make this work, you will need to compile the knn_graph.cpp file manually by going to the relevant directory (in your case: /usr/local/lib/python3.8/site-packages/scIB/knn_graph/) and compiling it with the Makefilie that is in the same directory, or just typing in:

g++ -std=c++11 -O3 knn_graph.cpp -o knn_graph.o

This should really be in the installation instructions on the main github README. Is this something you could briefly add @mumichae ?

  1. I understand NMI compares clustering via the Louvain algorithm. Does it make a difference if I originally clustered my data using the Leiden algorithm?

This shouldn't matter. The idea here is that your annotation is regarded as ground truth. Whatever method you used to cluster is only seen as a proxy for how to get to this ground-truth annotation.

why is n_neighbors=15? Does it affect results if I calculated a kNN graph for my own data using n_neighbors=30'? Also, should use_rep not accept the argument embed, as accepted in the metrics` function?

It shouldn't matter much if n_neighbors is 15 or 30. This function preps the knn_graph.o call from above which always gets 45 (i think) neighbours out via a graph-based shortest paths algorithm. If you seed with 30 that might be better, but 15 will give you similar results. In the end the difference will probably be marginal.

And you're right, the function would be more flexibly integrated if use_rep take the embed argument from metrics. That would probably require it to be passed to lisi_graph and the lisi functions as well... We will take a look at this and address it when we have scope. If you're interested in contributing a PR, please don't hesitate though :).

axelalmet commented 3 years ago

Hi @LuckyMD,

Thank you so much for your advice, this really made things a lot clearer!

Your suggestion to compile knn_graph.cpp worked and now the code seems to run. I now run into a different issue, which is that lisi_graph causes my laptop to crash. I also tried running the code on my work desktop, which is a much more powerful machine, but it also crashed on that. Mind you, the data consist of just over 100K cells, but is there a way to allow lisi_graph to be run on a local machine?

Thanks again, Axel.

LuckyMD commented 3 years ago

lisi_graph and kBet are by far the two most expensive metrics to run. We ran this on 100k cells on a cluster with 384GB memory. I would maybe just avoid these 3 metrics (cLISI and iLISI). We decided against subsampling in the metric implementation as we didn't know how it would behave...

axelalmet commented 3 years ago

Given that, I'm pretty glad I managed to get kBET to work. I have actually tried running the code on a lab server, but it remains to be seen how that goes.

Thank you very much for all your help!

LuckyMD commented 3 years ago

Good luck! I will close this for now. If anything comes up, let's discuss on a new thread or just re-open it if it's regarding the lisi_graph installation.