Open Hrovatin opened 2 years ago
Thanks for the issue! The example is essentially replicated by this unit test https://github.com/theislab/sfaira/blob/803aa981a8b1d89e543816ac4016d614f57ff05a/sfaira/unit_tests/tests_by_submodule/data/dataset/test_dataset.py#L128.
I am sure some of the symbols are wrong - not gene names.
This "havanna" labels appear in our internal processed gtf files for release "104", I'll look into where this is coming from. Until then, "102" seems save to use as it does not suffer from this issue (so switch to "102" is the quick fix for now if that works for you). I could imagine that this affects a particular biotype of non-protein-coding genes but will update here.
Also, may be problematic that symbols are repeated for some downstream applications? - Maybe worth adding var make unique within streamline or rather using EID matching in general?
If possible, downstream applications should operate on ENSG IDs if they require unique gene names, I think that would be a bioinformatics best practice. If symbols are required, I would recommend collapsing the matrix by symbols, this makes more sense to me then making IDs unique as this removes the redundancy in a sensible matter, but happy to hear opinions:
from sfaira.data.utils import collapse_matrix
# [...]
data.streamline_features(match_to_release='104')
adata = collapse_matrix(adata=data.adata, var_column="gene_symbol")
The weird gene symbols above, esp. "havana" were genes without symbol, I fixed these to receive the ID as a symbol in our interface (https://github.com/theislab/sfaira/issues/482). The code is pushed to dev, you will need to delete your cache ~./cache/sfaira
for this change to take effect.
Could you also add to documentation of DatasetInteractive that gene_symbol_col takes precedence over gene_ens_col? i was able to keep ens ids instead of symbols only after having set gene_esymbol_col to None.
Describe the bug The streamline features of DatasetInteractive produces incorrect and non-unique symbols. To Reproduce My adata:
Make interactive dataset and streamline features:
Expected behavior
I am sure some of the symbols are wrong - not gene names.
Also, may be problematic that symbols are repeated for some downstream applications? - Maybe worth adding var make unique within streamline or rather using EID matching in general?
System [please complete the following information]: sfaira recent dev (from 27. 1. 2022)