theislab / scib-pipeline

Snakemake pipeline that works with the scIB package to benchmark data integration methods.
MIT License
66 stars 27 forks source link

error running pipeline #44

Closed littleju714 closed 2 years ago

littleju714 commented 2 years ago

Hi! Thanks again for your excellent work!

I am running pipeline on my data, it has 4 study about myeloid cells from different labs. And their celltype are labeled based on different methods. For example, one study has celltype all as "myeloid cells", one as " TYPE1, TYPE2, TYPE3", one as" TAM1(PD-L1),TAM2".

I have get rid of "scanvi" and "scgen" methods in my config since they use celltype. But I keep the original celltype in the obs otherwise it will break in the embedding step. So can I still run the pipeline with my data?

It has the errors like : 1.

Traceback (most recent call last):
  File "scripts/integration/runIntegration.py", line 81, in <module>
    runIntegration(file, out, run, hvg, batch, celltype)
  File "scripts/integration/runIntegration.py", line 36, in runIntegration
    integrated = method(adata, batch)
  File "/data/msun/miniconda3/envs/scib-pipeline-R4.0/lib/python3.7/site-packages/scib/integration.py", line 317, in mnn
    **kwargs,
  File "/data/msun/miniconda3/envs/scib-pipeline-R4.0/lib/python3.7/site-packages/mnnpy/mnn.py", line 126, in mnn_correct
    svd_mode=svd_mode, do_concatenate=do_concatenate, **kwargs)
  File "/data/msun/miniconda3/envs/scib-pipeline-R4.0/lib/python3.7/site-packages/mnnpy/mnn.py", line 182, in mnn_correct
    new_batch_in, sigma)
IndexError: arrays used as indices must be of integer (or boolean) type

2.

Traceback (most recent call last):
  File "scripts/metrics/metrics.py", line 263, in <module>
    trajectory_=trajectory_
  File "/data/msun/miniconda3/envs/scib-pipeline-R4.0/lib/python3.7/site-packages/scib/metrics/metrics.py", line 340, in metrics
    verbose=False,
  File "/data/msun/miniconda3/envs/scib-pipeline-R4.0/lib/python3.7/site-packages/scib/metrics/silhouette.py", line 115, in silhouette_batch
    sil_means = sil_all.groupby("group").mean()
  File "/data/msun/miniconda3/envs/scib-pipeline-R4.0/lib/python3.7/site-packages/pandas/core/groupby/groupby.py", line 1499, in mean
    numeric_only=numeric_only,
  File "/data/msun/miniconda3/envs/scib-pipeline-R4.0/lib/python3.7/site-packages/pandas/core/groupby/generic.py", line 1016, in _cython_agg_general
    how, alt=alt, numeric_only=numeric_only, min_count=min_count
  File "/data/msun/miniconda3/envs/scib-pipeline-R4.0/lib/python3.7/site-packages/pandas/core/groupby/generic.py", line 1121, in _cython_agg_blocks
    raise DataError("No numeric types to aggregate")
pandas.core.base.DataError: No numeric types to aggregate

And here is my config:

ROOT: /data/msun/01_integration
r_env : scib-R4.0
py_env : scib-pipeline-R4.0
timing: false

unintegrated_metrics: false

FEATURE_SELECTION:
  hvg: 2000
  full_feature: 0

SCALING:
  - unscaled
  - scaled

METHODS:
# python methods : bbknn, combat, desc, mnn, saucie, scanorama, scanvi, scgen, scvi, trvae, trvaep
  bbknn:
    output_type: knn
  combat:
    output_type: full
  desc:
    output_type: embed
  mnn:
    output_type: full
  saucie:
    output_type:
      - full
      - embed
  scanorama:
    output_type:
      - embed
      - full
  #scanvi:
  #  output_type: embed
  #  no_scale: true
  #  use_celltype: true
  #scgen:
  #  output_type: full
  #  use_celltype: true
  scvi:
    no_scale: true
    output_type: embed
  #trvae:
  #  no_scale: true
  #  output_type:
  #    - embed
  #    - full
  #trvaep:
  #  no_scale: true
  #  output_type:
  #    - embed
  #    - full
# R methods : conos, fastmnn, harmony, liger, seurat, seuratpca
  conos: 
    R: true
    output_type: knn
  fastmnn:
    R: true
    output_type:
      - embed
      - full
  harmony:
    R: true
    output_type: embed
  liger:
    no_scale: true
    R: true
    output_type: embed
  seurat:
    R: true
    output_type: full
  seuratrpca:
      R: true
      output_type: full

DATA_SCENARIOS:
  integrate_output:
    batch_key: batch # name of key on anndata.obs that annotates the batches
    label_key: celltype  # name of key on anndata.obs that annotates the cell identity labels
    organism: mouse
    assay: expression
    file: /data/msun/01_integration/ori_data/with_layers/pure_adatas.h5ad

Could you help me with it? Does this error happen because of celltype issue or something else? Is it necessary to relabel their celltype?

Thank you for your time!!!!

littleju714 commented 2 years ago

I know how to fix the error 1: https://github.com/chriscainx/mnnpy/issues/30 I need to make the numba=0.45.0 and llvmlite 0.30.0, but it may be incompatible with others. So I give up mnn.

littleju714 commented 2 years ago

I have updated the metrics.py from the scib in github. And the error 2 becomes:

Traceback (most recent call last):
  File "scripts/metrics/metrics.py", line 263, in <module>
    trajectory_=trajectory_
  File "/data/msun/miniconda3/envs/scib-pipeline-R4.0/lib/python3.7/site-packages/scib/metrics/metrics.py", line 340, in metrics
    verbose=False,
  File "/data/msun/miniconda3/envs/scib-pipeline-R4.0/lib/python3.7/site-packages/scib/metrics/silhouette.py", line 113, in silhouette_batch
    sil_df = pd.concat(sil_dfs).reset_index(drop=True)
  File "/data/msun/miniconda3/envs/scib-pipeline-R4.0/lib/python3.7/site-packages/pandas/core/reshape/concat.py", line 295, in concat
    sort=sort,
  File "/data/msun/miniconda3/envs/scib-pipeline-R4.0/lib/python3.7/site-packages/pandas/core/reshape/concat.py", line 342, in __init__
    raise ValueError("No objects to concatenate")
ValueError: No objects to concatenate

But I don't know how to fix it.

mumichae commented 2 years ago

Hi, it seems like you might not be getting any values for the batch silhouette (batch ASW) score. Could you check what the result of the metric is on the integrated output that is causing the error?

import scib

asw_batch = scib.me.silhouette_batch(
    adata_int,
    batch_key=batch_key,
    group_key=label_key,
    embed='X_emb',
    return_all=True,
    verbose=True,
)

If return_all is True, you will get a Dataframe instead of an overall metric. I'm guessing it is empty in your case.

If 'X_emb' is not available, try computing and using the PCA instead

asw_batch = scib.me.silhouette_batch(
    adata_int,
    batch_key=batch_key,
    group_key=label_key,
    embed='X_pca',
    return_all=True,
    verbose=True,
)
mumichae commented 2 years ago

I changed the code so that you get NaN if the dataframe is empty. Feel free to update scib and rerun the pipeline.