swolock / scrublet

Detect doublets in single-cell RNA-seq data
MIT License
138 stars 73 forks source link

scrub_doublets still completes without annoy, nearest neighbor search #34

Open RoganGrant opened 2 years ago

RoganGrant commented 2 years ago

First of all, thank for scrublet! I have been using for a while now, and much prefer it over the alternatives.

As for the issue, it's a bit niche but can potentially cause serious silent issues on an HPC. Even if annoy is installed, loading can fail if a semi-recent version of gcc is not currently in the user's path. For HPC users, this would generally require loading a GCC module. In my case, module load gcc/11.2.0 restores the missing library and solves the issue.

Minimal code to reproduce the issue:

import scrublet as scr
mport scipy.io
import os
import gzip
import pandas as pd

counts_matrix = scipy.io.mmread(gzip.open("path/matrix.mtx.gz")).T.tocsc()
scrub = scr.Scrublet(counts_matrix, expected_doublet_rate = 0.1)
doublet_scores, doublets = scrub.scrub_doublets(min_counts=2, min_cells=3, min_gene_variability_pctl=85, n_prin_comps=30)

And the behavior:

Simulating doublets...
Embedding transcriptomes using PCA...
Calculating doublet scores...
Could not find library "annoy" for approx. nearest neighbor search
Automatically set threshold at doublet score = 0.62
Detected doublet rate = 0.4%
Estimated detectable doublet fraction = 4.5%
Overall doublet rate:
        Expected   = 10.0%
        Estimated  = 8.1%
Elapsed time: 10.9 seconds

In this case, doublet rate is still estimated, but apparently without finding nearest neighbors for simulated doublets. Or perhaps another method is used? Still, would be worth throwing a stronger warning of some sort or even failing in this case. If this analysis is automated, these sorts of messages may be missed entirely.

RoganGrant commented 2 years ago

And for clarity, this is what happens if I import annoy without a gcc module loaded, even though annoy is installed in my virtual environment:

import annoy
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "path/venv/lib/python3.9/site-packages/annoy/__init__.py", line 16, in <module>
    from .annoylib import Annoy as AnnoyIndex
ImportError: /lib64/libstdc++.so.6: version `CXXABI_1.3.9' not found (required by path/venv/lib/python3.9/site-packages/annoy/annoylib.cpython-39-x86_64-linux-gnu.so)
Lindsey8383 commented 1 year ago

You can run scrublet in your own conda environment and reference that environment's lib path rather than the HPC's by (replace w/ appropriate dir): export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:~/miniconda3/lib