nf-core / scdownstream

A single cell transcriptomics pipeline for QC, integration and making the data presentable
https://nf-co.re/scdownstream
MIT License
43 stars 12 forks source link

celltypist fails with value error #72

Closed fbnrst closed 3 months ago

fbnrst commented 3 months ago

Description of the bug

I ran scdownstream on the h5ad output of nf-core/scrnaseq using cellranger. scdownstream fails on celltypist giving this error:

ValueError: 🛑 No features overlap with the model. Please provide gene symbols

To track down what is going one, I loaded the input h5ad manually using scanpy. I could see that the AnnData object has gene symbols in var_names. But celltypist needs gene symbols.

I believe we need to make celltypist aware of where to look for gene symbols. How? We could sepcify a var column as a parameter in the samplesheet to select the column for gene symbols. Or we could implement an option to automatically convert gene ids. Or the celltypist process could get an additional parameter to select a column where to look for gene symbols? Not sure. I can try to provide a MWE, let me know if this is needed.

Command used and terminal output

No response

Relevant files

No response

System information

No response

nictru commented 3 months ago

I think the smoothest way would be an additional column in the samplesheet called symbol_column (defaulting to the var.index). Then we can add a section to ADATA_UNIFY where we can create a new column in var called gene_symbol where we put the content of symbol_column. Then we tell celltypist (and potentially other tools) to use this column.

It is important that the column has the same name in all datasets. Since it will be less comfortable for users to unify the column name themselves, we can just take the original name and create the unified column name in the pipeline.

If a sample does not contain any symbols, we can convert using mygene or something. I have a code snippet ready for that.