Closed fbnrst closed 3 months ago
I think the smoothest way would be an additional column in the samplesheet called symbol_column
(defaulting to the var.index
). Then we can add a section to ADATA_UNIFY
where we can create a new column in var
called gene_symbol
where we put the content of symbol_column
. Then we tell celltypist (and potentially other tools) to use this column.
It is important that the column has the same name in all datasets. Since it will be less comfortable for users to unify the column name themselves, we can just take the original name and create the unified column name in the pipeline.
If a sample does not contain any symbols, we can convert using mygene or something. I have a code snippet ready for that.
Description of the bug
I ran
scdownstream
on the h5ad output ofnf-core/scrnaseq
using cellranger.scdownstream
fails on celltypist giving this error:ValueError: 🛑 No features overlap with the model. Please provide gene symbols
To track down what is going one, I loaded the input h5ad manually using scanpy. I could see that the AnnData object has gene symbols in var_names. But celltypist needs gene symbols.
I believe we need to make celltypist aware of where to look for gene symbols. How? We could sepcify a var column as a parameter in the samplesheet to select the column for gene symbols. Or we could implement an option to automatically convert gene ids. Or the celltypist process could get an additional parameter to select a column where to look for gene symbols? Not sure. I can try to provide a MWE, let me know if this is needed.
Command used and terminal output
No response
Relevant files
No response
System information
No response