vib-singlecell-nf / scanpy

A repository for the scanpy module of the vib-singlecell-nf pipeline
3 stars 0 forks source link

Parameter exploration for clustering fails for some resolutions #41

Closed cflerin closed 4 years ago

cflerin commented 4 years ago

In the pipeline v0.16.0, using the following config:

         clustering {
            report_ipynb = '/src/scanpy/bin/reports/sc_clustering_report.ipynb'
            methods = ['louvain']
            resolutions = [0.4,0.8,1.0,1.2,1.6,2.0,4.0,8.0]
            off = 'h5ad'
         }

and the bbknn entry point, the error is:


------------------------------------------------------------------
 Aggregating multiple .h5ad files to Reyfman_healthy-donor.SC__SCANPY__DIM_REDUCTION_UMAP.louvain_0.4.loom
(w/ additional compression)...
------------------------------------------------------------------

WARN: Killing pending tasks (1)
WARN: To render the execution DAG in the required format it is required to install Graphviz -- See http://www.graphviz.org for more info.
Error executing process > 'bbknn_scenic:BBKNN:BEC_BBKNN:CLUSTER_IDENTIFICATION:SC__SCANPY__PARAM_EXPLORE_MARKER_GENES (8)'

Caused by:
  Process `bbknn_scenic:BBKNN:BEC_BBKNN:CLUSTER_IDENTIFICATION:SC__SCANPY__PARAM_EXPLORE_MARKER_GENES (8)` terminated with an error exit status (1)

Command executed:

  /user/leuven/325/vsc32528/.nextflow/assets/vib-singlecell-nf/vsn-pipelines/src/scanpy/bin/cluster/sc_marker_genes.py                  --method wilcoxon                       --groupby louvain           --ngenes 0                       Reyfman_healthy-donor.SC__SCANPY__DATA_TRANSFORMATION.h5ad                      Reyfman_healthy-donor.SC__SCANPY__PARAM_EXPLORE_CLUSTERING.h5ad                         "Reyfman_healthy-donor.SC__SCANPY__PARAM_EXPLORE_MARKER_GENES.louvain_8.0.h5ad"

Command exit status:
  1

Command output:
  The sc.tl.rank_genes_groups function expects log normalized data. Updating .raw slot by the .X slot from the AnnData generated by the normalize_transform step...
  Ranking genes for characterizing groups (aka sc.tl.rank_genes_groups) using wilcoxon method...

Command error:
  /opt/venv/lib/python3.7/site-packages/louvain/Optimiser.py:349: SyntaxWarning: assertion is always true, perhaps remove parentheses?
    assert(issubclass(partition_type, LinearResolutionParameterVertexPartition),
  Traceback (most recent call last):
    File "/user/leuven/325/vsc32528/.nextflow/assets/vib-singlecell-nf/vsn-pipelines/src/scanpy/bin/cluster/sc_marker_genes.py", line 94, in <module>
      n_genes=ngenes
    File "/opt/venv/lib/python3.7/site-packages/scanpy/tools/_rank_genes_groups.py", line 366, in rank_genes_groups
      means[imask], vars[imask] = _get_mean_var(X[mask]) #for fold-change
    File "/opt/venv/lib/python3.7/site-packages/scanpy/preprocessing/_utils.py", line 18, in _get_mean_var
      var = (mean_sq - mean**2) * (X.shape[0]/(X.shape[0]-1))
  ZeroDivisionError: division by zero

Work dir:
  /ddn1/vol1/staging/leuven/stg_00002/lcb/cflerin/analysis/GSE122960_Reyfman/donor/work/4d/0276b058064fe89ece7e61a3937e3b

Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run`

This seems to be a problem internal to the Scanpy function, but it must be related to the input data somehow. The input expression matrix is 41994 cells and 1612 genes.

When I remove the 8.0 resolution from the parameters:

resolutions = [0.4,0.8,1.0,1.2,1.6,2.0,4.0]

the pipeline completes without error.

dweemx commented 4 years ago

Another solution would be to do the DE with rank_gene_groups for each cluster oneself where you'd then have control and properly handle the case for singlet clusters. For now 415e70f is reasonable.