Closed nictru closed 3 months ago
@fbnrst, is this approximately what you imagined? I didn't test it yet, but if you want you can give it a go and let me know if something needs to be changed
nf-core lint
overall result: Passed :white_check_mark: :warning:Posted for pipeline commit dcedb58
+| ✅ 207 tests passed |+
!| ❗ 13 tests had warnings |!
This is how I imagined it, and I am running a test right now. But your points are very valid and it is not clear to me how duplicate var_names
should be handled. I checked celltypist a little, and their approach seems to be that you can convert the models using a mapping file that maps from IDs to symbols, see e.g. https://github.com/Teichlab/celltypist/issues/87
They also provide some mappings from Ensembl to gene symbols: https://github.com/Teichlab/celltypist/tree/main/celltypist/data/samples
There, they also provide files for other species and of course this is another relevant point, celltypist was developed having human in mind.
So: What about using the mapping file instead of providing gene symbols column? Also, after realising the species issue, it might be good to be able to skip cell typing as well, sometimes, without a good reference, this might not make sense.
Celltypist is already optional, it will only be executed if a celltypist model is provided as a parameter
My test run failed because the gene_symbol
column was not available. I checked the combine/merge/merged_outer.h5ad
in the outputs, which I believe is read by celltypist. I think the concatenation needs to be changed like this by adding a merge argument:
adata_outer = ad.concat(adatas, join="outer", index_unique="-", merge="unique")
I am currently running a test on this again.
But I realised another thing: When doing an outer merge 0s are filled in the count matrix (see the warning under notes here: https://anndata.readthedocs.io/en/latest/generated/anndata.concat.html). I think this is not generally correct, because missing data for that gene does not mean it was not expressed. I wonder whether it would be better to run celltypist on the inner join, or alternativly, to run celltypist on the individual samples before merging.
I created #74 to fix some little things, with those, I managed to run the pipeline using gene symbols for cell typist
When doing an outer merge 0s are filled in the count matrix (see the warning under notes here: https://anndata.readthedocs.io/en/latest/generated/anndata.concat.html). I think this is not generally correct, because missing data for that gene does not mean it was not expressed.
True. The outer join is generally done so that people don't wonder why their favorite genes are not present in the output object. Filling with NaN
would hugely increase dataset size, as it would need to be explicitly stored in sparse matrices.
Running celltypist per dataset before merging would a good solution I think
This PR adds documentation and a basic implementation of this feature. Some questions need to be cleared before merging:
gene_symbol
column survive AnnData concatenation?gene_symbol
column but not in thevar_names
?