Closed smoe closed 4 months ago
I can reproduce the error now locally with conda:
R[write to console]: --------------------------------------------------
R[write to console]: Mon Jul 1 22:56:54 2024 .. Analyzing all cells
R[write to console]: Mon Jul 1 22:56:54 2024 .... Generating UMAP and estimating cell types
R[write to console]: Error in .local(x, ...) : size factors should be positive
---------------------------------------------------------------------------
RRuntimeError Traceback (most recent call last)
File /data/sm718/IBIMA/Antifibrotix_SingleCell_Culture/nextflow_scdownstream/work/cc/57b6f68e5ee01374bbf3366910a36b/.command.sh:37
34 if len(adata.obs['batch'].unique()) > 1:
35 kwargs['batch'] = adata.obs['batch'].tolist()
---> 37 corrected = celda.decontX(sce, **kwargs)
38 counts = celda.decontXcounts(corrected)
40 adata.layers['ambient'] = anndata2ri.rpy2py(counts).T
File ~/miniconda3/lib/python3.12/site-packages/rpy2/robjects/functions.py:208, in SignatureTranslatedFunction.__call__(self, *args, **kwargs)
206 v = kwargs.pop(k)
207 kwargs[r_k] = v
--> 208 return (super(SignatureTranslatedFunction, self)
209 .__call__(*args, **kwargs))
File ~/miniconda3/lib/python3.12/site-packages/rpy2/robjects/functions.py:131, in Function.__call__(self, *args, **kwargs)
129 else:
130 new_kwargs[k] = cv.py2rpy(v)
--> 131 res = super(Function, self).__call__(*new_args, **new_kwargs)
132 res = cv.rpy2py(res)
133 return res
File ~/miniconda3/lib/python3.12/site-packages/rpy2/rinterface_lib/conversion.py:45, in _cdata_res_to_rinterface.<locals>._(*args, **kwargs)
44 def _(*args, **kwargs):
---> 45 cdata = function(*args, **kwargs)
46 # TODO: test cdata is of the expected CType
47 return _cdata_to_rinterface(cdata)
File ~/miniconda3/lib/python3.12/site-packages/rpy2/rinterface.py:817, in SexpClosure.__call__(self, *args, **kwargs)
810 res = rmemory.protect(
811 openrlib.rlib.R_tryEval(
812 call_r,
813 call_context.__sexp__._cdata,
814 error_occured)
815 )
816 if error_occured[0]:
--> 817 raise embedded.RRuntimeError(_rinterface._geterrmessage())
818 return res
RRuntimeError: Error in .local(x, ...) : size factors should be positive
I toyed around with ipython but did not really know what to look for. Can you direct me?
In [15]: print(adata)
AnnData object with n_obs × n_vars = 10806 × 89916
obs: 'sample', 'batch', 'label', 'sample_original'
layers: 'counts'
In [18]: print(sce)
class: SingleCellExperiment
dim: 89916 10806
metadata(0):
assays(2): X counts
rownames(89916): ACADM AGL ... TTTY17C-A LOC102724004-A
rowData names(0):
colnames(10806): ATACCGAAGACGTCCC TCATGCCTCTAATTCC ... CATCGGGAGAAGCTCG
GTTCGCTAGAGATGCC
colData names(4): sample batch label sample_original
reducedDimNames(0):
mainExpName: NULL
altExpNames(0):
In [24]: kwargs = {}
...:
...: if len(adata.obs['batch'].unique()) > 1:
...: kwargs['batch'] = adata.obs['batch'].tolist()
...:
In [25]: adata.obs['batch'].unique()
Out[25]:
['TGFb_48h_JM_1h']
Categories (1, object): ['TGFb_48h_JM_1h']
This seems to be a problem with the ambient RNA step since the sizefactors being calculated earlier are somehow coming out as negative. Haven’t seen it much as this is highly case specific but filtering process often needs to be checked. Usually if low quality cells with few expressed genes aren’t properly filter then it can give negative size factor estimates. Don’t have a great solution as of now but probably the h5 filtering is needed or if you start with filtered h5 matrices from scnraseq or the empty drops one.
Just found a couple of threads if that helps.
1. https://support.bioconductor.org/p/106422/
Vivek Das, PhD
On Mon, Jul 1, 2024 at 10:58 PM Steffen Möller @.***> wrote:
I can reproduce the error now locally with conda:
R[write to console]: --------------------------------------------------
R[write to console]: Mon Jul 1 22:56:54 2024 .. Analyzing all cells
R[write to console]: Mon Jul 1 22:56:54 2024 .... Generating UMAP and estimating cell types
R[write to console]: Error in .local(x, ...) : size factors should be positive
RRuntimeError Traceback (most recent call last) File /data/sm718/IBIMA/Antifibrotix_SingleCell_Culture/nextflow_scdownstream/work/cc/57b6f68e5ee01374bbf3366910a36b/.command.sh:37 34 if len(adata.obs['batch'].unique()) > 1: 35 kwargs['batch'] = adata.obs['batch'].tolist() ---> 37 corrected = celda.decontX(sce, **kwargs) 38 counts = celda.decontXcounts(corrected) 40 adata.layers['ambient'] = anndata2ri.rpy2py(counts).T
File ~/miniconda3/lib/python3.12/site-packages/rpy2/robjects/functions.py:208, in SignatureTranslatedFunction.call(self, *args, *kwargs) 206 v = kwargs.pop(k) 207 kwargs[r_k] = v --> 208 return (super(SignatureTranslatedFunction, self) 209 .call(args, **kwargs))
File ~/miniconda3/lib/python3.12/site-packages/rpy2/robjects/functions.py:131, in Function.call(self, *args, *kwargs) 129 else: 130 new_kwargs[k] = cv.py2rpy(v) --> 131 res = super(Function, self).call(new_args, **new_kwargs) 132 res = cv.rpy2py(res) 133 return res
File ~/miniconda3/lib/python3.12/site-packages/rpy2/rinterface_lib/conversion.py:45, in _cdata_res_torinterface.
. (*args, kwargs) 44 def _(*args, *kwargs): ---> 45 cdata = function(args, kwargs) 46 # TODO: test cdata is of the expected CType 47 return _cdata_to_rinterface(cdata)File ~/miniconda3/lib/python3.12/site-packages/rpy2/rinterface.py:817, in SexpClosure.call(self, *args, **kwargs) 810 res = rmemory.protect( 811 openrlib.rlib.R_tryEval( 812 call_r, 813 call_context.sexp._cdata, 814 error_occured) 815 ) 816 if error_occured[0]: --> 817 raise embedded.RRuntimeError(_rinterface._geterrmessage()) 818 return res
RRuntimeError: Error in .local(x, ...) : size factors should be positive
— Reply to this email directly, view it on GitHub https://github.com/nf-core/scdownstream/issues/51#issuecomment-2201040368, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB43EC2BVG5EKONTZ6QH2ETZKG7HTAVCNFSM6AAAAABKGF75TGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMBRGA2DAMZWHA . You are receiving this because you are subscribed to this thread.Message ID: @.***>
I've seen this several times, so far it was always due to very small or low quality datasets. I don't think this is really a problem with the pipeline, as it would also occur when using the tool outside of it. But maybe we can come up with a better error message for future users.
I had just used default values. I had another look at https://nf-co.re/scdownstream/dev/docs/usage and it seems like previously unseen sample attributes have surfaced. I found
| min_genes | Minimum number of genes required for a cell to be considered. Defaults to 1. | | min_cells | Minimum number of cells required for a gene to be considered. Defaults to 1. | | min_counts_cell | Minimum number of counts required for a cell to be considered. Defaults to 1. | | min_counts_gene | Minimum number of counts required for a gene to be considered. Defaults to 1 |
which I guess should be addressed. Are there accepted heuristics to determine reasonable settings from some basic statistics of the data? I looked at https://www.nature.com/articles/s41576-023-00586-w and following a link to a book I found https://www.sc-best-practices.org/preprocessing_visualization/feature_selection.html that aims for the most deviant genes, however those are exactly defined.
min_genes at 1 feels a bit low since we would have no information on how to interpret that value, no context. Some gut feeling tells me I would want some house keeping genes to be present for a cell to be accepted, not just some arbitrary gene. I would feel like increasing that value to 100.
min_cells - half the cells per experiment? 3000?
min_counts_cell - half of sequencing depth / cell count?
min_counts_gene - 5?
I'll start another run with these parameters set and report if the error disappears. Maybe this thread develops towards somethign that could be added as a rough guide to the documentation.
Edit: After yet another reread I found the column "raw" and the comment that the h5ad data linked to in the file column should be filtered - which mine was not. The raw column needs to be added in addition to the file column, which I suggest to mention in the usage description. The error does not go away, though. Is that because the filtering is meant to happen within scanpy_filter which is executed only after the ambient_rna_removal?
[- ] process > NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:PREPROCESS:SCANPY_READH5 -
[- ] process > NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:PREPROCESS:ADATA_READRDS -
[- ] process > NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:PREPROCESS:ADATA_READCSV -
[6b/6e0e12] process > NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:PREPROCESS:ADATA_UNIFY (Con) [100%] 6 of 6 ✔
[27/a7bad9] process > NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:PREPROCESS:QC_RAW (T0) [100%] 6 of 6 ✔
[d5/5f55d7] process > NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:PREPROCESS:AMBIENT_RNA_REMOVAL:CELDA_DECONTX (TGFb_48h) [100%] 6 of 6, failed: 6 ✘
[- ] process > NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:PREPROCESS:SCANPY_FILTER -
[- ] process > NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:PREPROCESS:DOUBLET_DETECTION:ADATA_TORDS -
[- ] process > NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:PREPROCESS:DOUBLET_DETECTION:SCANPY_SCRUBLET -
[- ] process > NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:PREPROCESS:DOUBLET_DETECTION:DOUBLET_REMOVAL -
[- ] process > NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:PREPROCESS:QC_FILTERED
I added documentation for the manual filtering and raw column only yesterday, sorry for the inconvenience.
The default cutoffs are 1 so that all-zero genes and cells cannot occur (which would break some things). The single-cell best practices tutorial is a good starting point for determining sensible cutoffs with their MADs-based approach.
The raw
column is only necessary when used with one of the following methods for ambient RNA removal: cellbender
, scAR
or soupX
. When one of these methods is specified via the according parameter, you will need to input the filtered
output from cellranger as the file
column and the unfiltered
cellranger output as the raw
column. For other methods of ambient RNA removal (such as decontX
), it will probably work best if you provide an input as raw and unfiltered as possible via the file
column and provide sensible manual cutoffs via the corresponding cutoff columns. That's why the column is called file
and not filtered
.
I added documentation for the manual filtering and raw column only yesterday, sorry for the inconvenience.
;) Thank you tons for saying this, I was already doubting me.
The default cutoffs are 1 so that all-zero genes and cells cannot occur (which would break some things). The single-cell best practices tutorial is a good starting point for determining sensible cutoffs with their MADs-based approach.
The
raw
column is only necessary when used with one of the following methods for ambient RNA removal:cellbender
,scAR
orsoupX
. When one of these methods is specified via the according parameter, you will need to input thefiltered
output from cellranger as thefile
column and theunfiltered
cellranger output as theraw
column.
Now, I do not have any cellranger output generated from the scrnaseq pipeline.
For other methods of ambient RNA removal (such as
decontX
), it will probably work best if you provide an input as raw and unfiltered as possible via thefile
column and provide sensible manual cutoffs via the corresponding cutoff columns. That's why the column is calledfile
and notfiltered
.
I have set file to the same as raw for now, just to see what I get. If that is not a good idea then please allow me to prepare a PR that prevents this from happening - it is just too tempting not to try.
From what I read now, I should leave everything as it is and only change the method for ambient RNA removal. But which one. My nf-core/scrnaseq run was initiated with (mostly) defaults only
nextflow run nf-core/scrnaseq --input ./samplesheet.csv —aligner star --outdir ./results --genome GRCh38 --protocol 10XV3 -profile singularity -resume
which I happily restart in some other way (please direct me) and I only have the folders
$ ls results/
alevin alevinqc fastqc multiqc pipeline_info
I somehow recall to have read about cellbender already in the scrnaseq documentation but could not find it again while rereading. The parameters page of scdownstream seems gone, in the code I found
params.ambient_removal = 'soupx'
which I added to my nextflow.config, which then generated the following error - you already indicated that I would need a different kind of input for the 'file' column, so I did not expect this to work, but the error message may nonetheless be of interest:
executor > slurm (18)
[- ] process > NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:PREPROCESS:SCANPY_READH5 -
[- ] process > NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:PREPROCESS:ADATA_READRDS -
[- ] process > NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:PREPROCESS:ADATA_READCSV -
[5a/967a36] process > NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:PREPROCESS:ADATA_UNIFY (Con) [100%] 6 of 6 ✔
[f4/e83c78] process > NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:PREPROCESS:QC_RAW (JM_24h) [100%] 6 of 6 ✔
[2e/6a7098] process > NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:PREPROCESS:AMBIENT_RNA_REMOVAL:SOUPX (TGFb_48h) [100%] 6 of 6, failed: 6 ✘
[- ] process > NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:PREPROCESS:SCANPY_FILTER -
[- ] process > NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:PREPROCESS:DOUBLET_DETECTION:ADATA_TORDS -
[- ] process > NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:PREPROCESS:DOUBLET_DETECTION:SCANPY_SCRUBLET -
[- ] process > NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:PREPROCESS:DOUBLET_DETECTION:DOUBLET_REMOVAL -
[- ] process > NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:PREPROCESS:QC_FILTERED -
[- ] process > NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:COMBINE:ADATA_MERGE -
[- ] process > NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:COMBINE:ADATA_UPSETGENES -
[- ] process > NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:COMBINE:INTEGRATE:ADATA_TORDS -
[- ] process > NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:COMBINE:INTEGRATE:SCVITOOLS_SCVI -
[- ] process > NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:COMBINE:INTEGRATE:ADATA_READRDS -
[- ] process > NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:CLUSTER:SCANPY_NEIGHBORS -
[- ] process > NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:CLUSTER:SCANPY_UMAP -
[- ] process > NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:CLUSTER:SCANPY_LEIDEN -
[- ] process > NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:FINALIZE:ADATA_EXTEND -
[- ] process > NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:FINALIZE:ADATA_TORDS -
[- ] process > NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:MULTIQC -
-[nf-core/scdownstream] Pipeline completed with errors-
ERROR ~ Error executing process > 'NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:PREPROCESS:AMBIENT_RNA_REMOVAL:SOUPX (JM_24h)'
Caused by:
Process `NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:PREPROCESS:AMBIENT_RNA_REMOVAL:SOUPX (JM_24h)` terminated with an error exit status (1)
Command executed [/home/sm718/.nextflow/assets/nf-core/scdownstream/./workflows/../subworkflows/local/./../../modules/local/soupx/templates/soupx.py]:
#!/usr/bin/env python3
import scanpy
import anndata2ri
import rpy2
import rpy2.robjects as ro
import platform
soupx = ro.packages.importr('SoupX')
def format_yaml_like(data: dict, indent: int = 0) -> str:
"""Formats a dictionary to a YAML-like string.
Args:
data (dict): The dictionary to format.
indent (int): The current indentation level.
Returns:
str: A string formatted as YAML.
"""
yaml_str = ""
for key, value in data.items():
spaces = " " * indent
if isinstance(value, dict):
yaml_str += f"{spaces}{key}:\n{format_yaml_like(value, indent + 1)}"
else:
yaml_str += f"{spaces}{key}: {value}\n"
return yaml_str
adata = scanpy.read_h5ad("JM_24h_unified.h5ad")
adata_pp = adata.copy()
scanpy.pp.normalize_per_cell(adata_pp)
scanpy.pp.log1p(adata_pp)
scanpy.pp.pca(adata_pp)
scanpy.pp.neighbors(adata_pp)
scanpy.tl.leiden(adata_pp, key_added="soupx_groups")
soupx_groups = anndata2ri.py2rpy(adata_pp.obs["soupx_groups"])
del adata_pp
sce = anndata2ri.py2rpy(adata)
adata_raw = scanpy.read_h5ad("JM_24h_matrix.h5ad")
sce_raw = anndata2ri.py2rpy(adata_raw)
get_counts = ro.r("function(sce) { assay(sce, 'X') }")
data = get_counts(sce)
data_raw = get_counts(sce_raw)
sc = soupx.SoupChannel(data_raw, data, calcSoupProfile = False)
soupProf = ro.r("function(data) { data.frame(row.names = rownames(data), est = rowSums(data)/sum(data), counts = rowSums(data)) }")(data)
sc = soupx.setSoupProfile(sc, soupProf)
sc = soupx.setClusters(sc, soupx_groups)
sc = soupx.autoEstCont(sc, doPlot = False)
out = soupx.adjustCounts(sc, roundToInt = False)
adata.layers["ambient"] = anndata2ri.rpy2py(out).T
adata.write_h5ad("JM_24h_soupX.h5ad")
# Versions
versions = {
"NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:PREPROCESS:AMBIENT_RNA_REMOVAL:SOUPX": {
"python": platform.python_version(),
"scanpy": scanpy.__version__,
"anndata2ri": anndata2ri.__version__,
"rpy2": rpy2.__version__,
"SoupX": soupx.__version__,
}
}
with open("versions.yml", "w") as f:
f.write(format_yaml_like(versions))
Command exit status:
1
Command output:
WARNING: The R package "reticulate" only fixed recently
an issue that caused a segfault when used with rpy2:
https://github.com/rstudio/reticulate/pull/1188
Make sure that you use a version of that package that includes
the fix.
Command error:
.command.sh:37: FutureWarning: In the future, the default backend for leiden will be igraph instead of leidenalg.
To achieve the future defaults please pass: flavor="igraph" and n_iterations=2. directed must also be False to work with igraph's implementation.
scanpy.tl.leiden(adata_pp, key_added="soupx_groups")
R[write to console]: Error in (function (sc, clusters) :
Invalid cluster specification. See help.
Traceback (most recent call last):
File ".command.sh", line 56, in <module>
sc = soupx.setClusters(sc, soupx_groups)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.12/site-packages/rpy2/robjects/functions.py", line 208, in __call__
return (super(SignatureTranslatedFunction, self)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.12/site-packages/rpy2/robjects/functions.py", line 131, in __call__
res = super(Function, self).__call__(*new_args, **new_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.12/site-packages/rpy2/rinterface_lib/conversion.py", line 45, in _
cdata = function(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.12/site-packages/rpy2/rinterface.py", line 817, in __call__
raise embedded.RRuntimeError(_rinterface._geterrmessage())
rpy2.rinterface_lib.embedded.RRuntimeError: Error in (function (sc, clusters) :
Invalid cluster specification. See help.
Work dir:
/data/....../nextflow_scdownstream/work/e9/d90c51ace0000e8ca17014b16f7a04
Tip: view the complete command output by changing to the process work dir and entering the command `cat .command.out`
-- Check '.nextflow.log' file for details
ERROR ~ Pipeline failed. Please refer to troubleshooting docs: https://nf-co.re/docs/usage/troubleshooting
-- Check '.nextflow.log' file for details
I then also tried with the scAR method, but it asks me to lower the prob parameter - truly sounds like my data is unexpectedly sparse:
executor > slurm (7)
[- ] process > NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:PREPROCESS:SCANPY_READH5 -
[- ] process > NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:PREPROCESS:ADATA_READRDS -
[- ] process > NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:PREPROCESS:ADATA_READCSV -
[8a/071b3a] process > NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:PREPROCESS:ADATA_UNIFY (T0) [100%] 6 of 6, cached: 6 ✔
[17/61a455] process > NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:PREPROCESS:QC_RAW (TGFb_48h_JM_1h) [100%] 6 of 6, cached: 6 ✔
[a4/94a4ce] process > NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:PREPROCESS:AMBIENT_RNA_REMOVAL:SCVITOOLS_SCAR (T0) [ 55%] 5 of 9, failed: 5, retries: 3
[- ] process > NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:PREPROCESS:SCANPY_FILTER -
[- ] process > NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:PREPROCESS:DOUBLET_DETECTION:ADATA_TORDS -
[- ] process > NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:PREPROCESS:DOUBLET_DETECTION:SCANPY_SCRUBLET -
[- ] process > NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:PREPROCESS:DOUBLET_DETECTION:DOUBLET_REMOVAL -
[- ] process > NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:PREPROCESS:QC_FILTERED -
[- ] process > NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:COMBINE:ADATA_MERGE -
[- ] process > NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:COMBINE:ADATA_UPSETGENES -
[- ] process > NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:COMBINE:INTEGRATE:ADATA_TORDS -
[- ] process > NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:COMBINE:INTEGRATE:SCVITOOLS_SCVI -
[- ] process > NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:COMBINE:INTEGRATE:ADATA_READRDS -
[- ] process > NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:CLUSTER:SCANPY_NEIGHBORS -
[- ] process > NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:CLUSTER:SCANPY_UMAP -
[- ] process > NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:CLUSTER:SCANPY_LEIDEN -
[- ] process > NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:FINALIZE:ADATA_EXTEND -
[- ] process > NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:FINALIZE:ADATA_TORDS -
[- ] process > NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:MULTIQC -
Execution cancelled -- Finishing pending tasks before exit
[b6/2a178e] NOTE: Process `NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:PREPROCESS:AMBIENT_RNA_REMOVAL:SCVITOOLS_SCAR (T0)` terminated with an error exit status (134) -- Execution is retried (1)
[7a/c1029d] NOTE: Process `NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:PREPROCESS:AMBIENT_RNA_REMOVAL:SCVITOOLS_SCAR (TGFb_48h)` terminated with an error exit status (137) -- Execution is retried (1)
[f6/846c54] NOTE: Process `NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:PREPROCESS:AMBIENT_RNA_REMOVAL:SCVITOOLS_SCAR (JM_1h)` terminated with an error exit status (137) -- Execution is retried (1)
ERROR ~ Error executing process > 'NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:PREPROCESS:AMBIENT_RNA_REMOVAL:SCVITOOLS_SCAR (JM_24h)'
Caused by:
Process `NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:PREPROCESS:AMBIENT_RNA_REMOVAL:SCVITOOLS_SCAR (JM_24h)` terminated with an error exit status (1)
Command executed [/home/sm718/.nextflow/assets/nf-core/scdownstream/./workflows/../subworkflows/local/./../../modules/local/scvitools/scar/templates/scar.py]:
#!/usr/bin/env python3
import scvi
import anndata as ad
from scvi.external import SCAR
import platform
from threadpoolctl import threadpool_limits
threadpool_limits(int("6"))
def format_yaml_like(data: dict, indent: int = 0) -> str:
"""Formats a dictionary to a YAML-like string.
Args:
data (dict): The dictionary to format.
indent (int): The current indentation level.
Returns:
str: A string formatted as YAML.
"""
yaml_str = ""
for key, value in data.items():
spaces = " " * indent
if isinstance(value, dict):
yaml_str += f"{spaces}{key}:\n{format_yaml_like(value, indent + 1)}"
else:
yaml_str += f"{spaces}{key}: {value}\n"
return yaml_str
adata = ad.read_h5ad("JM_24h_unified.h5ad")
adata_raw = ad.read_h5ad("JM_24h_matrix.h5ad")
# TODO: Find out why the batch_key='batch' argument causes an error.
SCAR.setup_anndata(adata)
SCAR.get_ambient_profile(adata, adata_raw)
vae = SCAR(adata)
# The training fails if the number of entries in a minibatch is 1.
# This happens if the total number of entries modulo the batch size is 1.
# We therefore decrease the batch size until this is not the case.
batch_size = 128
worked = False
while not worked:
try:
vae.train(
batch_size=batch_size,
early_stopping=True
)
worked = True
except ValueError as e:
batch_size -= 1
if batch_size < 125:
raise e
adata.layers["ambient"] = vae.get_denoised_counts()
adata.write_h5ad("JM_24h.h5ad")
# Versions
versions = {
"NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:PREPROCESS:AMBIENT_RNA_REMOVAL:SCVITOOLS_SCAR": {
"python": platform.python_version(),
"anndata": ad.__version__,
"scvi": scvi.__version__
}
}
with open("versions.yml", "w") as f:
f.write(format_yaml_like(versions))
Command exit status:
1
Command output:
Randomly sampling 50000 droplets to calculate the ambient profile.
Working...: 0%| | 0/3 [00:00<?, ?it/s]
Working...: 0%| | 0/3 [00:03<?, ?it/s]
Command error:
/opt/conda/lib/python3.12/site-packages/scvi/external/scar/_model.py:242: ImplicitModificationWarning: Trying to modify attribute `.obs` of view, initializing view as actual.
raw_adata.obs["log_prob"] = log_prob
Traceback (most recent call last):
File ".command.sh", line 37, in <module>
SCAR.get_ambient_profile(adata, adata_raw)
File "/opt/conda/lib/python3.12/site-packages/scvi/external/scar/_model.py", line 253, in get_ambient_profile
raise Exception("Too few emptydroplets. Lower the prob parameter")
Exception: Too few emptydroplets. Lower the prob parameter
Work dir:
/data/....../nextflow_scdownstream/work/c8/093049cab61692cdfd8925c011c648
Tip: view the complete command output by changing to the process work dir and entering the command `cat .command.out`
-- Check '.nextflow.log' file for details
ERROR ~ Pipeline failed. Please refer to troubleshooting docs: https://nf-co.re/docs/usage/troubleshooting
-- Check '.nextflow.log' file for details
Me only starting to come to grasp how this pipeline works, would you tend to think it is already time to ask to increase the depth (same cells resequenced)?
Okay I will try to explain this again:
Cellranger provides you with two matrices, a filtered and an unfiltered one. Details about their properties can be found here. Using the same matrix for both fields violates this definition. If you come from a different quantification method, you might be able to replicate this structure in some way, but this is not in the scope of this pipeline. There are three methods for ambient RNA removal which are only meant to be used if the pair of matrices follows the expected structure: cellbender
, soupX
and scAR
.
If you do not have the two matrices, you should still be able to use decontX
for ambient RNA removal - which brings us back to the original purpose of this issue. In this case, you do not have to specify the raw
field. The file
matrix should be as raw and unfiltered as possible, with thresholds defined in the other fields of the samplesheet. For the problem with the negative size factors, you might want to play around a bit with filtering, maybe exclude genes with very low counts. If you found a solution, I might add it to the pipeline.
It might be interesting to add an option none
for ambient RNA removal, to disable it entirely.
Can we have (would you not mind an attempt of a PR for) such a paragraph in the documentation? My confusion was cellranger and cellbender somehow occupied the same place in my brain. I would also like to point out that cellranger is ran within the scrnaseq pipeline.
Meta: Should we find default settings for scrnaseq and scdownstream that are the most possible robust ones? This would bring some success early and then one can still see what to improve / what extra analyses to add.
Feel free to open an according documentation PR - you probably have a better understanding of what could potentially confuse people than I have.
Meta: Should we find default settings for scrnaseq and scdownstream that are the most possible robust ones? This would bring some success early and then one can still see what to improve / what extra analyses to add.
So the problem we have is that alevin
is the default aligner in scrnaseq
. As alevin
does not produce the filtered-unfiltered-combination as cellranger
does, the only applicable method for ambient RNA removal is decontX
, so I think it makes sense to keep decontX
as the default. Also I just realised decontX
can also make use of filtered-unfiltered-combinations which is not implemented yet - so it is the only option that can be used with all kinds of data.
Now we have the problem that the alevin
output from scrnaseq
apparently can lead to errors when used with decontX
without previous filtering. I think it would be reasonable to perform minimal filtering of the input whenever decontX
is used without a raw
file. We just need to find out what this minimal filtering could look like
Would it be ok for you to provide me with the input data that leads to your decontX
problems? Could try to find out how we can prevent these problems then.
Could you try if the filtering
branch (#62) solves this problem?
I did not observe the original issue any more.
Description of the bug
I wish I could state anything that would sound helpful here. The script triggering the issue is ./modules/local/celda/decontx/templates/decontx.py and I agree that sizes typically should not be negative.
Would you by chance have some debug statements I should add?
I have started out with the h5ad file that were generated by nf-core/scrnaseq. Should I go for another matrix format? But this error is already too far downstream for input formats to be of relevance, right?
Command used and terminal output
Relevant files
.command.out was empty .command.sh is that Python script also presented to stdout The only other file is the h5ad input file that was nf-core/scrnaseq-generated.
System information
$ nextflow -version
Hardware: HPC Executor: slurm Container engine: Singularity OS: CentOS Linux Version of nf-core/scdownstream: -r dev