nf-core / scdownstream

A single cell transcriptomics pipeline for QC, integration and making the data presentable
https://nf-co.re/scdownstream
MIT License
5 stars 3 forks source link

CELDA_DECONTX - bails out over negative size factors #51

Open smoe opened 4 days ago

smoe commented 4 days ago

Description of the bug

I wish I could state anything that would sound helpful here. The script triggering the issue is ./modules/local/celda/decontx/templates/decontx.py and I agree that sizes typically should not be negative.

Would you by chance have some debug statements I should add?

I have started out with the h5ad file that were generated by nf-core/scrnaseq. Should I go for another matrix format? But this error is already too far downstream for input formats to be of relevance, right?

Command used and terminal output

$ nextflow run nf-core/scdownstream -profile singularity --input samplesheet.csv --outdir results_20240701 -r dev

...

ERROR ~ Error executing process > 'NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:PREPROCESS:AMBIENT_RNA_REMOVAL:CELDA_DECONTX (TGFb_48h_JM_1h)'

Caused by:
  Process `NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:PREPROCESS:AMBIENT_RNA_REMOVAL:CELDA_DECONTX (TGFb_48h_JM_1h)` terminated with an error exit status (1)

Command executed [/home/sm718/.nextflow/assets/nf-core/scdownstream/./workflows/../subworkflows/local/./../../modules/local/celda/decontx/templates/decontx.py]:

  #!/usr/bin/env python3

  import anndata as ad
  import anndata2ri
  import rpy2
  import rpy2.robjects as ro
  import platform
  celda = ro.packages.importr('celda')

  def format_yaml_like(data: dict, indent: int = 0) -> str:
      """Formats a dictionary to a YAML-like string.

      Args:
          data (dict): The dictionary to format.
          indent (int): The current indentation level.

      Returns:
          str: A string formatted as YAML.
      """
      yaml_str = ""
      for key, value in data.items():
          spaces = "  " * indent
          if isinstance(value, dict):
              yaml_str += f"{spaces}{key}:\n{format_yaml_like(value, indent + 1)}"
          else:
              yaml_str += f"{spaces}{key}: {value}\n"
      return yaml_str

  adata = ad.read_h5ad("TGFb_48h_JM_1h_unified.h5ad")
  sce = anndata2ri.py2rpy(adata)

  kwargs = {}

  if len(adata.obs['batch'].unique()) > 1:
      kwargs['batch'] = adata.obs['batch'].tolist()

  corrected = celda.decontX(sce, **kwargs)
  counts = celda.decontXcounts(corrected)

  adata.layers['ambient'] = anndata2ri.rpy2py(counts).T
  adata.write_h5ad("TGFb_48h_JM_1h_decontx.h5ad")

  # Versions

  versions = {
      "NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:PREPROCESS:AMBIENT_RNA_REMOVAL:CELDA_DECONTX": {
          "python": platform.python_version(),
          "anndata": ad.__version__,
          "anndata2ri": anndata2ri.__version__,
          "rpy2": rpy2.__version__,
          "celda": celda.__version__,
      }
  }

  with open("versions.yml", "w") as f:
      f.write(format_yaml_like(versions))

Command exit status:
  1

Command output:
  (empty)

Command error:
  R[write to console]: --------------------------------------------------

  R[write to console]: Starting DecontX

  R[write to console]: --------------------------------------------------

  R[write to console]: Mon Jul  1 21:29:28 2024 .. Analyzing all cells

  R[write to console]: Mon Jul  1 21:29:28 2024 .... Generating UMAP and estimating cell types

  R[write to console]: Error in .local(x, ...) : size factors should be positive

  Traceback (most recent call last):
    File ".command.sh", line 37, in <module>
      corrected = celda.decontX(sce, **kwargs)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/opt/conda/lib/python3.12/site-packages/rpy2/robjects/functions.py", line 208, in __call__
      return (super(SignatureTranslatedFunction, self)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/opt/conda/lib/python3.12/site-packages/rpy2/robjects/functions.py", line 131, in __call__
      res = super(Function, self).__call__(*new_args, **new_kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/opt/conda/lib/python3.12/site-packages/rpy2/rinterface_lib/conversion.py", line 45, in _
      cdata = function(*args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/opt/conda/lib/python3.12/site-packages/rpy2/rinterface.py", line 817, in __call__
      raise embedded.RRuntimeError(_rinterface._geterrmessage())
  rpy2.rinterface_lib.embedded.RRuntimeError: Error in .local(x, ...) : size factors should be positive

Work dir:
  /data/sm718/IBIMA/Antifibrotix_SingleCell_Culture/nextflow_scdownstream/work/cc/57b6f68e5ee01374bbf3366910a36b

Tip: view the complete command output by changing to the process work dir and entering the command `cat .command.out`

 -- Check '.nextflow.log' file for details
ERROR ~ Pipeline failed. Please refer to troubleshooting docs: https://nf-co.re/docs/usage/troubleshooting

 -- Check '.nextflow.log' file for details

Relevant files

.command.out was empty .command.sh is that Python script also presented to stdout The only other file is the h5ad input file that was nf-core/scrnaseq-generated.

System information

$ nextflow -version

  N E X T F L O W
  version 24.04.2 build 5914
  created 29-05-2024 06:19 UTC (08:19 CEST)
  cite doi:10.1038/nbt.3820
  http://nextflow.io

Hardware: HPC Executor: slurm Container engine: Singularity OS: CentOS Linux Version of nf-core/scdownstream: -r dev

smoe commented 3 days ago

I can reproduce the error now locally with conda:

R[write to console]: --------------------------------------------------                                                                                                                              

R[write to console]: Mon Jul  1 22:56:54 2024 .. Analyzing all cells                                                                                                                                 

R[write to console]: Mon Jul  1 22:56:54 2024 .... Generating UMAP and estimating cell types                                                                                                         

R[write to console]: Error in .local(x, ...) : size factors should be positive                                                                                                                       

---------------------------------------------------------------------------                                                                                                                          
RRuntimeError                             Traceback (most recent call last)                                                                                                                          
File /data/sm718/IBIMA/Antifibrotix_SingleCell_Culture/nextflow_scdownstream/work/cc/57b6f68e5ee01374bbf3366910a36b/.command.sh:37                                                                   
     34 if len(adata.obs['batch'].unique()) > 1:                                                                                                                                                     
     35     kwargs['batch'] = adata.obs['batch'].tolist()                                                                                                                                            
---> 37 corrected = celda.decontX(sce, **kwargs)                                                                                                                                                     
     38 counts = celda.decontXcounts(corrected)                                                                                                                                                      
     40 adata.layers['ambient'] = anndata2ri.rpy2py(counts).T                                                                                                                                        

File ~/miniconda3/lib/python3.12/site-packages/rpy2/robjects/functions.py:208, in SignatureTranslatedFunction.__call__(self, *args, **kwargs)                                                        
    206         v = kwargs.pop(k)                                                                                                                                                                    
    207         kwargs[r_k] = v                                                                                                                                                                      
--> 208 return (super(SignatureTranslatedFunction, self)                                                                                                                                             
    209         .__call__(*args, **kwargs))                                                                                                                                                          

File ~/miniconda3/lib/python3.12/site-packages/rpy2/robjects/functions.py:131, in Function.__call__(self, *args, **kwargs)                                                                           
    129     else:                                                                                                                                                                                    
    130         new_kwargs[k] = cv.py2rpy(v)                                                                                                                                                         
--> 131 res = super(Function, self).__call__(*new_args, **new_kwargs)                                                                                                                                
    132 res = cv.rpy2py(res)                                                                                                                                                                         
    133 return res                                                                                                                                                                                   

File ~/miniconda3/lib/python3.12/site-packages/rpy2/rinterface_lib/conversion.py:45, in _cdata_res_to_rinterface.<locals>._(*args, **kwargs)                                                         
     44 def _(*args, **kwargs):                                                                                                                                                                      
---> 45     cdata = function(*args, **kwargs)                                                                                                                                                        
     46     # TODO: test cdata is of the expected CType                                                                                                                                              
     47     return _cdata_to_rinterface(cdata)                                                                                                                                                       

File ~/miniconda3/lib/python3.12/site-packages/rpy2/rinterface.py:817, in SexpClosure.__call__(self, *args, **kwargs)                                                                                
    810     res = rmemory.protect(                                                                                                                                                                   
    811         openrlib.rlib.R_tryEval(                                                                                                                                                             
    812             call_r,                                                                                                                                                                          
    813             call_context.__sexp__._cdata,
    814             error_occured)
    815     )
    816     if error_occured[0]:
--> 817         raise embedded.RRuntimeError(_rinterface._geterrmessage())
    818 return res

RRuntimeError: Error in .local(x, ...) : size factors should be positive
smoe commented 3 days ago

I toyed around with ipython but did not really know what to look for. Can you direct me?

In [15]: print(adata)
AnnData object with n_obs × n_vars = 10806 × 89916
    obs: 'sample', 'batch', 'label', 'sample_original'
    layers: 'counts'

In [18]: print(sce)
class: SingleCellExperiment 
dim: 89916 10806 
metadata(0):
assays(2): X counts
rownames(89916): ACADM AGL ... TTTY17C-A LOC102724004-A
rowData names(0):
colnames(10806): ATACCGAAGACGTCCC TCATGCCTCTAATTCC ... CATCGGGAGAAGCTCG
  GTTCGCTAGAGATGCC
colData names(4): sample batch label sample_original
reducedDimNames(0):
mainExpName: NULL
altExpNames(0):

In [24]:   kwargs = {}
    ...: 
    ...:   if len(adata.obs['batch'].unique()) > 1:
    ...:       kwargs['batch'] = adata.obs['batch'].tolist()
    ...: 

In [25]: adata.obs['batch'].unique()
Out[25]: 
['TGFb_48h_JM_1h']
Categories (1, object): ['TGFb_48h_JM_1h']
vd4mmind commented 3 days ago

This seems to be a problem with the ambient RNA step since the sizefactors being calculated earlier are somehow coming out as negative. Haven’t seen it much as this is highly case specific but filtering process often needs to be checked. Usually if low quality cells with few expressed genes aren’t properly filter then it can give negative size factor estimates. Don’t have a great solution as of now but probably the h5 filtering is needed or if you start with filtered h5 matrices from scnraseq or the empty drops one.

Just found a couple of threads if that helps.

1. https://support.bioconductor.org/p/106422/


Vivek Das, PhD

On Mon, Jul 1, 2024 at 10:58 PM Steffen Möller @.***> wrote:

I can reproduce the error now locally with conda:

R[write to console]: --------------------------------------------------

R[write to console]: Mon Jul 1 22:56:54 2024 .. Analyzing all cells

R[write to console]: Mon Jul 1 22:56:54 2024 .... Generating UMAP and estimating cell types

R[write to console]: Error in .local(x, ...) : size factors should be positive


RRuntimeError Traceback (most recent call last) File /data/sm718/IBIMA/Antifibrotix_SingleCell_Culture/nextflow_scdownstream/work/cc/57b6f68e5ee01374bbf3366910a36b/.command.sh:37 34 if len(adata.obs['batch'].unique()) > 1: 35 kwargs['batch'] = adata.obs['batch'].tolist() ---> 37 corrected = celda.decontX(sce, **kwargs) 38 counts = celda.decontXcounts(corrected) 40 adata.layers['ambient'] = anndata2ri.rpy2py(counts).T

File ~/miniconda3/lib/python3.12/site-packages/rpy2/robjects/functions.py:208, in SignatureTranslatedFunction.call(self, *args, *kwargs) 206 v = kwargs.pop(k) 207 kwargs[r_k] = v --> 208 return (super(SignatureTranslatedFunction, self) 209 .call(args, **kwargs))

File ~/miniconda3/lib/python3.12/site-packages/rpy2/robjects/functions.py:131, in Function.call(self, *args, *kwargs) 129 else: 130 new_kwargs[k] = cv.py2rpy(v) --> 131 res = super(Function, self).call(new_args, **new_kwargs) 132 res = cv.rpy2py(res) 133 return res

File ~/miniconda3/lib/python3.12/site-packages/rpy2/rinterface_lib/conversion.py:45, in _cdata_res_torinterface..(*args, kwargs) 44 def _(*args, *kwargs): ---> 45 cdata = function(args, kwargs) 46 # TODO: test cdata is of the expected CType 47 return _cdata_to_rinterface(cdata)

File ~/miniconda3/lib/python3.12/site-packages/rpy2/rinterface.py:817, in SexpClosure.call(self, *args, **kwargs) 810 res = rmemory.protect( 811 openrlib.rlib.R_tryEval( 812 call_r, 813 call_context.sexp._cdata, 814 error_occured) 815 ) 816 if error_occured[0]: --> 817 raise embedded.RRuntimeError(_rinterface._geterrmessage()) 818 return res

RRuntimeError: Error in .local(x, ...) : size factors should be positive

— Reply to this email directly, view it on GitHub https://github.com/nf-core/scdownstream/issues/51#issuecomment-2201040368, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB43EC2BVG5EKONTZ6QH2ETZKG7HTAVCNFSM6AAAAABKGF75TGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMBRGA2DAMZWHA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

nictru commented 3 days ago

I've seen this several times, so far it was always due to very small or low quality datasets. I don't think this is really a problem with the pipeline, as it would also occur when using the tool outside of it. But maybe we can come up with a better error message for future users.

smoe commented 3 days ago

I had just used default values. I had another look at https://nf-co.re/scdownstream/dev/docs/usage and it seems like previously unseen sample attributes have surfaced. I found

| min_genes | Minimum number of genes required for a cell to be considered. Defaults to 1. | | min_cells | Minimum number of cells required for a gene to be considered. Defaults to 1. | | min_counts_cell | Minimum number of counts required for a cell to be considered. Defaults to 1. | | min_counts_gene | Minimum number of counts required for a gene to be considered. Defaults to 1 |

which I guess should be addressed. Are there accepted heuristics to determine reasonable settings from some basic statistics of the data? I looked at https://www.nature.com/articles/s41576-023-00586-w and following a link to a book I found https://www.sc-best-practices.org/preprocessing_visualization/feature_selection.html that aims for the most deviant genes, however those are exactly defined.

min_genes at 1 feels a bit low since we would have no information on how to interpret that value, no context. Some gut feeling tells me I would want some house keeping genes to be present for a cell to be accepted, not just some arbitrary gene. I would feel like increasing that value to 100.

min_cells - half the cells per experiment? 3000?

min_counts_cell - half of sequencing depth / cell count?

min_counts_gene - 5?

I'll start another run with these parameters set and report if the error disappears. Maybe this thread develops towards somethign that could be added as a rough guide to the documentation.

Edit: After yet another reread I found the column "raw" and the comment that the h5ad data linked to in the file column should be filtered - which mine was not. The raw column needs to be added in addition to the file column, which I suggest to mention in the usage description. The error does not go away, though. Is that because the filtering is meant to happen within scanpy_filter which is executed only after the ambient_rna_removal?

[-        ] process > NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:PREPROCESS:SCANPY_READH5                                -
[-        ] process > NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:PREPROCESS:ADATA_READRDS                                -
[-        ] process > NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:PREPROCESS:ADATA_READCSV                                -
[6b/6e0e12] process > NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:PREPROCESS:ADATA_UNIFY (Con)                            [100%] 6 of 6 ✔
[27/a7bad9] process > NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:PREPROCESS:QC_RAW (T0)                                  [100%] 6 of 6 ✔
[d5/5f55d7] process > NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:PREPROCESS:AMBIENT_RNA_REMOVAL:CELDA_DECONTX (TGFb_48h) [100%] 6 of 6, failed: 6 ✘
[-        ] process > NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:PREPROCESS:SCANPY_FILTER                                -
[-        ] process > NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:PREPROCESS:DOUBLET_DETECTION:ADATA_TORDS                -
[-        ] process > NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:PREPROCESS:DOUBLET_DETECTION:SCANPY_SCRUBLET            -
[-        ] process > NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:PREPROCESS:DOUBLET_DETECTION:DOUBLET_REMOVAL            -
[-        ] process > NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:PREPROCESS:QC_FILTERED
nictru commented 3 days ago

I added documentation for the manual filtering and raw column only yesterday, sorry for the inconvenience.

The default cutoffs are 1 so that all-zero genes and cells cannot occur (which would break some things). The single-cell best practices tutorial is a good starting point for determining sensible cutoffs with their MADs-based approach.

The raw column is only necessary when used with one of the following methods for ambient RNA removal: cellbender, scAR or soupX. When one of these methods is specified via the according parameter, you will need to input the filtered output from cellranger as the file column and the unfiltered cellranger output as the raw column. For other methods of ambient RNA removal (such as decontX), it will probably work best if you provide an input as raw and unfiltered as possible via the file column and provide sensible manual cutoffs via the corresponding cutoff columns. That's why the column is called file and not filtered.

smoe commented 3 days ago

I added documentation for the manual filtering and raw column only yesterday, sorry for the inconvenience.

;) Thank you tons for saying this, I was already doubting me.

The default cutoffs are 1 so that all-zero genes and cells cannot occur (which would break some things). The single-cell best practices tutorial is a good starting point for determining sensible cutoffs with their MADs-based approach.

The raw column is only necessary when used with one of the following methods for ambient RNA removal: cellbender, scAR or soupX. When one of these methods is specified via the according parameter, you will need to input the filtered output from cellranger as the file column and the unfiltered cellranger output as the raw column.

Now, I do not have any cellranger output generated from the scrnaseq pipeline.

For other methods of ambient RNA removal (such as decontX), it will probably work best if you provide an input as raw and unfiltered as possible via the file column and provide sensible manual cutoffs via the corresponding cutoff columns. That's why the column is called file and not filtered.

I have set file to the same as raw for now, just to see what I get. If that is not a good idea then please allow me to prepare a PR that prevents this from happening - it is just too tempting not to try.

From what I read now, I should leave everything as it is and only change the method for ambient RNA removal. But which one. My nf-core/scrnaseq run was initiated with (mostly) defaults only

 nextflow run nf-core/scrnaseq --input ./samplesheet.csv —aligner star --outdir ./results --genome GRCh38 --protocol 10XV3 -profile singularity -resume

which I happily restart in some other way (please direct me) and I only have the folders

$ ls results/
alevin  alevinqc  fastqc  multiqc  pipeline_info

I somehow recall to have read about cellbender already in the scrnaseq documentation but could not find it again while rereading. The parameters page of scdownstream seems gone, in the code I found

params.ambient_removal = 'soupx'

which I added to my nextflow.config, which then generated the following error - you already indicated that I would need a different kind of input for the 'file' column, so I did not expect this to work, but the error message may nonetheless be of interest:

executor >  slurm (18)
[-        ] process > NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:PREPROCESS:SCANPY_READH5                        -
[-        ] process > NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:PREPROCESS:ADATA_READRDS                        -
[-        ] process > NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:PREPROCESS:ADATA_READCSV                        -
[5a/967a36] process > NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:PREPROCESS:ADATA_UNIFY (Con)                    [100%] 6 of 6 ✔
[f4/e83c78] process > NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:PREPROCESS:QC_RAW (JM_24h)                      [100%] 6 of 6 ✔
[2e/6a7098] process > NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:PREPROCESS:AMBIENT_RNA_REMOVAL:SOUPX (TGFb_48h) [100%] 6 of 6, failed: 6 ✘
[-        ] process > NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:PREPROCESS:SCANPY_FILTER                        -
[-        ] process > NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:PREPROCESS:DOUBLET_DETECTION:ADATA_TORDS        -
[-        ] process > NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:PREPROCESS:DOUBLET_DETECTION:SCANPY_SCRUBLET    -
[-        ] process > NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:PREPROCESS:DOUBLET_DETECTION:DOUBLET_REMOVAL    -
[-        ] process > NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:PREPROCESS:QC_FILTERED                          -
[-        ] process > NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:COMBINE:ADATA_MERGE                             -
[-        ] process > NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:COMBINE:ADATA_UPSETGENES                        -
[-        ] process > NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:COMBINE:INTEGRATE:ADATA_TORDS                   -
[-        ] process > NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:COMBINE:INTEGRATE:SCVITOOLS_SCVI                -
[-        ] process > NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:COMBINE:INTEGRATE:ADATA_READRDS                 -
[-        ] process > NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:CLUSTER:SCANPY_NEIGHBORS                        -
[-        ] process > NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:CLUSTER:SCANPY_UMAP                             -
[-        ] process > NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:CLUSTER:SCANPY_LEIDEN                           -
[-        ] process > NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:FINALIZE:ADATA_EXTEND                           -
[-        ] process > NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:FINALIZE:ADATA_TORDS                            -
[-        ] process > NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:MULTIQC                                         -
-[nf-core/scdownstream] Pipeline completed with errors-
ERROR ~ Error executing process > 'NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:PREPROCESS:AMBIENT_RNA_REMOVAL:SOUPX (JM_24h)'

Caused by:
  Process `NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:PREPROCESS:AMBIENT_RNA_REMOVAL:SOUPX (JM_24h)` terminated with an error exit status (1)

Command executed [/home/sm718/.nextflow/assets/nf-core/scdownstream/./workflows/../subworkflows/local/./../../modules/local/soupx/templates/soupx.py]:

  #!/usr/bin/env python3

  import scanpy
  import anndata2ri
  import rpy2
  import rpy2.robjects as ro
  import platform
  soupx = ro.packages.importr('SoupX')

  def format_yaml_like(data: dict, indent: int = 0) -> str:
      """Formats a dictionary to a YAML-like string.

      Args:
          data (dict): The dictionary to format.
          indent (int): The current indentation level.

      Returns:
          str: A string formatted as YAML.
      """
      yaml_str = ""
      for key, value in data.items():
          spaces = "  " * indent
          if isinstance(value, dict):
              yaml_str += f"{spaces}{key}:\n{format_yaml_like(value, indent + 1)}"
          else:
              yaml_str += f"{spaces}{key}: {value}\n"
      return yaml_str

  adata = scanpy.read_h5ad("JM_24h_unified.h5ad")

  adata_pp = adata.copy()
  scanpy.pp.normalize_per_cell(adata_pp)
  scanpy.pp.log1p(adata_pp)

  scanpy.pp.pca(adata_pp)
  scanpy.pp.neighbors(adata_pp)
  scanpy.tl.leiden(adata_pp, key_added="soupx_groups")

  soupx_groups = anndata2ri.py2rpy(adata_pp.obs["soupx_groups"])
  del adata_pp

  sce = anndata2ri.py2rpy(adata)

  adata_raw = scanpy.read_h5ad("JM_24h_matrix.h5ad")
  sce_raw = anndata2ri.py2rpy(adata_raw)

  get_counts = ro.r("function(sce) { assay(sce, 'X') }")

  data = get_counts(sce)
  data_raw = get_counts(sce_raw)

  sc = soupx.SoupChannel(data_raw, data, calcSoupProfile = False)

  soupProf = ro.r("function(data) { data.frame(row.names = rownames(data), est = rowSums(data)/sum(data), counts = rowSums(data)) }")(data)
  sc = soupx.setSoupProfile(sc, soupProf)
  sc = soupx.setClusters(sc, soupx_groups)
  sc = soupx.autoEstCont(sc, doPlot = False)
  out = soupx.adjustCounts(sc, roundToInt = False)

  adata.layers["ambient"] = anndata2ri.rpy2py(out).T

  adata.write_h5ad("JM_24h_soupX.h5ad")

  # Versions

  versions = {
      "NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:PREPROCESS:AMBIENT_RNA_REMOVAL:SOUPX": {
          "python": platform.python_version(),
          "scanpy": scanpy.__version__,
          "anndata2ri": anndata2ri.__version__,
          "rpy2": rpy2.__version__,
          "SoupX": soupx.__version__,
      }
  }

  with open("versions.yml", "w") as f:
      f.write(format_yaml_like(versions))

Command exit status:
  1

Command output:

      WARNING: The R package "reticulate" only fixed recently
      an issue that caused a segfault when used with rpy2:
      https://github.com/rstudio/reticulate/pull/1188
      Make sure that you use a version of that package that includes
      the fix.

Command error:
  .command.sh:37: FutureWarning: In the future, the default backend for leiden will be igraph instead of leidenalg.

   To achieve the future defaults please pass: flavor="igraph" and n_iterations=2.  directed must also be False to work with igraph's implementation.
    scanpy.tl.leiden(adata_pp, key_added="soupx_groups")
  R[write to console]: Error in (function (sc, clusters)  :
    Invalid cluster specification.  See help.

  Traceback (most recent call last):
    File ".command.sh", line 56, in <module>
      sc = soupx.setClusters(sc, soupx_groups)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/opt/conda/lib/python3.12/site-packages/rpy2/robjects/functions.py", line 208, in __call__
      return (super(SignatureTranslatedFunction, self)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/opt/conda/lib/python3.12/site-packages/rpy2/robjects/functions.py", line 131, in __call__
      res = super(Function, self).__call__(*new_args, **new_kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/opt/conda/lib/python3.12/site-packages/rpy2/rinterface_lib/conversion.py", line 45, in _
      cdata = function(*args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/opt/conda/lib/python3.12/site-packages/rpy2/rinterface.py", line 817, in __call__
      raise embedded.RRuntimeError(_rinterface._geterrmessage())
  rpy2.rinterface_lib.embedded.RRuntimeError: Error in (function (sc, clusters)  :
    Invalid cluster specification.  See help.

Work dir:
  /data/....../nextflow_scdownstream/work/e9/d90c51ace0000e8ca17014b16f7a04

Tip: view the complete command output by changing to the process work dir and entering the command `cat .command.out`

 -- Check '.nextflow.log' file for details
ERROR ~ Pipeline failed. Please refer to troubleshooting docs: https://nf-co.re/docs/usage/troubleshooting

 -- Check '.nextflow.log' file for details

I then also tried with the scAR method, but it asks me to lower the prob parameter - truly sounds like my data is unexpectedly sparse:

executor >  slurm (7)
[-        ] process > NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:PREPROCESS:SCANPY_READH5                               -
[-        ] process > NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:PREPROCESS:ADATA_READRDS                               -
[-        ] process > NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:PREPROCESS:ADATA_READCSV                               -
[8a/071b3a] process > NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:PREPROCESS:ADATA_UNIFY (T0)                            [100%] 6 of 6, cached: 6 ✔
[17/61a455] process > NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:PREPROCESS:QC_RAW (TGFb_48h_JM_1h)                     [100%] 6 of 6, cached: 6 ✔
[a4/94a4ce] process > NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:PREPROCESS:AMBIENT_RNA_REMOVAL:SCVITOOLS_SCAR (T0)     [ 55%] 5 of 9, failed: 5, retries: 3
[-        ] process > NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:PREPROCESS:SCANPY_FILTER                               -
[-        ] process > NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:PREPROCESS:DOUBLET_DETECTION:ADATA_TORDS               -
[-        ] process > NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:PREPROCESS:DOUBLET_DETECTION:SCANPY_SCRUBLET           -
[-        ] process > NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:PREPROCESS:DOUBLET_DETECTION:DOUBLET_REMOVAL           -
[-        ] process > NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:PREPROCESS:QC_FILTERED                                 -
[-        ] process > NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:COMBINE:ADATA_MERGE                                    -
[-        ] process > NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:COMBINE:ADATA_UPSETGENES                               -
[-        ] process > NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:COMBINE:INTEGRATE:ADATA_TORDS                          -
[-        ] process > NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:COMBINE:INTEGRATE:SCVITOOLS_SCVI                       -
[-        ] process > NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:COMBINE:INTEGRATE:ADATA_READRDS                        -
[-        ] process > NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:CLUSTER:SCANPY_NEIGHBORS                               -
[-        ] process > NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:CLUSTER:SCANPY_UMAP                                    -
[-        ] process > NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:CLUSTER:SCANPY_LEIDEN                                  -
[-        ] process > NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:FINALIZE:ADATA_EXTEND                                  -
[-        ] process > NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:FINALIZE:ADATA_TORDS                                   -
[-        ] process > NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:MULTIQC                                                -
Execution cancelled -- Finishing pending tasks before exit
[b6/2a178e] NOTE: Process `NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:PREPROCESS:AMBIENT_RNA_REMOVAL:SCVITOOLS_SCAR (T0)` terminated with an error exit status (134) -- Execution is retried (1)
[7a/c1029d] NOTE: Process `NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:PREPROCESS:AMBIENT_RNA_REMOVAL:SCVITOOLS_SCAR (TGFb_48h)` terminated with an error exit status (137) -- Execution is retried (1)
[f6/846c54] NOTE: Process `NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:PREPROCESS:AMBIENT_RNA_REMOVAL:SCVITOOLS_SCAR (JM_1h)` terminated with an error exit status (137) -- Execution is retried (1)
ERROR ~ Error executing process > 'NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:PREPROCESS:AMBIENT_RNA_REMOVAL:SCVITOOLS_SCAR (JM_24h)'

Caused by:
  Process `NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:PREPROCESS:AMBIENT_RNA_REMOVAL:SCVITOOLS_SCAR (JM_24h)` terminated with an error exit status (1)

Command executed [/home/sm718/.nextflow/assets/nf-core/scdownstream/./workflows/../subworkflows/local/./../../modules/local/scvitools/scar/templates/scar.py]:

  #!/usr/bin/env python3

  import scvi
  import anndata as ad
  from scvi.external import SCAR
  import platform

  from threadpoolctl import threadpool_limits
  threadpool_limits(int("6"))

  def format_yaml_like(data: dict, indent: int = 0) -> str:
      """Formats a dictionary to a YAML-like string.

      Args:
          data (dict): The dictionary to format.
          indent (int): The current indentation level.

      Returns:
          str: A string formatted as YAML.
      """
      yaml_str = ""
      for key, value in data.items():
          spaces = "  " * indent
          if isinstance(value, dict):
              yaml_str += f"{spaces}{key}:\n{format_yaml_like(value, indent + 1)}"
          else:
              yaml_str += f"{spaces}{key}: {value}\n"
      return yaml_str

  adata = ad.read_h5ad("JM_24h_unified.h5ad")
  adata_raw = ad.read_h5ad("JM_24h_matrix.h5ad")

  # TODO: Find out why the batch_key='batch' argument causes an error.
  SCAR.setup_anndata(adata)
  SCAR.get_ambient_profile(adata, adata_raw)

  vae = SCAR(adata)

  # The training fails if the number of entries in a minibatch is 1.
  # This happens if the total number of entries modulo the batch size is 1.
  # We therefore decrease the batch size until this is not the case.
  batch_size = 128
  worked = False
  while not worked:
      try:
          vae.train(
              batch_size=batch_size,
              early_stopping=True
          )
          worked = True
      except ValueError as e:
          batch_size -= 1

          if batch_size < 125:
              raise e

  adata.layers["ambient"] = vae.get_denoised_counts()

  adata.write_h5ad("JM_24h.h5ad")

  # Versions

  versions = {
      "NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:PREPROCESS:AMBIENT_RNA_REMOVAL:SCVITOOLS_SCAR": {
          "python": platform.python_version(),
          "anndata": ad.__version__,
          "scvi": scvi.__version__
      }
  }

  with open("versions.yml", "w") as f:
      f.write(format_yaml_like(versions))

Command exit status:
  1

Command output:
  Randomly sampling 50000 droplets to calculate the ambient profile.

  Working...:   0%|          | 0/3 [00:00<?, ?it/s]
  Working...:   0%|          | 0/3 [00:03<?, ?it/s]

Command error:
  /opt/conda/lib/python3.12/site-packages/scvi/external/scar/_model.py:242: ImplicitModificationWarning: Trying to modify attribute `.obs` of view, initializing view as actual.
    raw_adata.obs["log_prob"] = log_prob
  Traceback (most recent call last):
    File ".command.sh", line 37, in <module>
      SCAR.get_ambient_profile(adata, adata_raw)
    File "/opt/conda/lib/python3.12/site-packages/scvi/external/scar/_model.py", line 253, in get_ambient_profile
      raise Exception("Too few emptydroplets. Lower the prob parameter")
  Exception: Too few emptydroplets. Lower the prob parameter

Work dir:
  /data/....../nextflow_scdownstream/work/c8/093049cab61692cdfd8925c011c648

Tip: view the complete command output by changing to the process work dir and entering the command `cat .command.out`

 -- Check '.nextflow.log' file for details
ERROR ~ Pipeline failed. Please refer to troubleshooting docs: https://nf-co.re/docs/usage/troubleshooting

 -- Check '.nextflow.log' file for details

Me only starting to come to grasp how this pipeline works, would you tend to think it is already time to ask to increase the depth (same cells resequenced)?

nictru commented 2 days ago

Okay I will try to explain this again:

Cellranger provides you with two matrices, a filtered and an unfiltered one. Details about their properties can be found here. Using the same matrix for both fields violates this definition. If you come from a different quantification method, you might be able to replicate this structure in some way, but this is not in the scope of this pipeline. There are three methods for ambient RNA removal which are only meant to be used if the pair of matrices follows the expected structure: cellbender, soupX and scAR.

If you do not have the two matrices, you should still be able to use decontX for ambient RNA removal - which brings us back to the original purpose of this issue. In this case, you do not have to specify the raw field. The file matrix should be as raw and unfiltered as possible, with thresholds defined in the other fields of the samplesheet. For the problem with the negative size factors, you might want to play around a bit with filtering, maybe exclude genes with very low counts. If you found a solution, I might add it to the pipeline.

It might be interesting to add an option none for ambient RNA removal, to disable it entirely.

smoe commented 1 day ago

Can we have (would you not mind an attempt of a PR for) such a paragraph in the documentation? My confusion was cellranger and cellbender somehow occupied the same place in my brain. I would also like to point out that cellranger is ran within the scrnaseq pipeline.

Meta: Should we find default settings for scrnaseq and scdownstream that are the most possible robust ones? This would bring some success early and then one can still see what to improve / what extra analyses to add.

nictru commented 1 day ago

Feel free to open an according documentation PR - you probably have a better understanding of what could potentially confuse people than I have.

Meta: Should we find default settings for scrnaseq and scdownstream that are the most possible robust ones? This would bring some success early and then one can still see what to improve / what extra analyses to add.

So the problem we have is that alevin is the default aligner in scrnaseq. As alevin does not produce the filtered-unfiltered-combination as cellranger does, the only applicable method for ambient RNA removal is decontX, so I think it makes sense to keep decontX as the default. Also I just realised decontX can also make use of filtered-unfiltered-combinations which is not implemented yet - so it is the only option that can be used with all kinds of data.

Now we have the problem that the alevin output from scrnaseq apparently can lead to errors when used with decontX without previous filtering. I think it would be reasonable to perform minimal filtering of the input whenever decontX is used without a raw file. We just need to find out what this minimal filtering could look like

nictru commented 1 day ago

Would it be ok for you to provide me with the input data that leads to your decontX problems? Could try to find out how we can prevent these problems then.