openproblems-bio / openproblems-v2

Formalizing and benchmarking open problems in single-cell genomics
MIT License
56 stars 19 forks source link

[bat_int] cell cycle genes not in adata for cellxgene_census dataset #331

Open KaiWaldrant opened 8 months ago

KaiWaldrant commented 8 months ago

Describe the bug The Metric cell_cycle_conservation fails with datasets from the Cell x Gene Census:

ValueError: cell cycle genes not in adata
 organism: human
 varnames: ['ENSG00000105792', 'ENSG00000128253', 'ENSG00000015413', 'ENSG00000164402', 'ENSG00000246375', 'ENSG00000176402', 'ENSG00000022976', 'ENSG00000123191', 'ENSG00000198283', 'ENSG00000092020']

To Reproduce https://tower.nf/orgs/openproblems-bio/workspaces/openproblems-bio/watch/ma2LsRoQarR8Z

Expected behavior A clear and concise description of what you expected to happen.

Additional context Add any other context about the problem here.

mumichae commented 8 months ago

Hi, so this issue comes from the fact that the cell cycle genes used are available as gene names, not Ensembl IDs. However CxG uses Ensembl IDs in the var names. I would suggest to overwrite the var_names with adata.var["feature_name"], if that column exists during the processing. Does that sound reasonable?

rcannood commented 8 months ago

I tend to prefer to set the var names to emsembl ids instead of the gene names, because otherwise there are duplicate var names. WDYT?

mumichae commented 8 months ago

In general that makes sense, but for the cell cycle metric we would still need gene symbols. Would you prefer to rename the var_names only for the metric instead?