Streamline features produces wrong gene symbols

Hrovatin commented 2 years ago

Describe the bug The streamline features of DatasetInteractive produces incorrect and non-unique symbols. To Reproduce My adata:

a
View of AnnData object with n_obs × n_vars = 100 × 2
    obs: 'assay_sc', 'assay_differentiation', 'assay_type_differentiation', 'bio_sample', 'cell_line', 'cell_type', 'development_stage', 'disease', 'ethnicity', 'id', 'individual', 'organ', 'organism', 'sex', 'state_exact', 'sample_source', 'tech_sample', 'assay_sc_ontology_term_id', 'cell_line_ontology_term_id', 'cell_type_ontology_term_id', 'development_stage_ontology_term_id', 'disease_ontology_term_id', 'ethnicity_ontology_term_id', 'organ_ontology_term_id', 'organism_ontology_term_id', 'sex_ontology_term_id'
    var: 'ensembl', 'gene_symbol'
    uns: 'annotated', 'author', 'default_embedding', 'doi_journal', 'doi_preprint', 'download_url_data', 'download_url_meta', 'normalization', 'primary_data', 'title', 'year', 'load_raw', 'mapped_features', 'remove_gene_version', 'assay_sc', 'assay_differentiation', 'assay_type_differentiation', 'cell_line', 'cell_type', 'development_stage', 'disease', 'ethnicity', 'id', 'organ', 'organism', 'sex', 'state_exact', 'sample_source'

Var:
a.var
                         ensembl gene_symbol
ENSG00000000003  ENSG00000000003      TSPAN6
ENSG00000000005  ENSG00000000005        TNMD

Make interactive dataset and streamline features:

data = DatasetInteractive(a, gene_ens_col='index')
data.organism = 'Homo sapiens'
data.streamline_features(match_to_release='104')

/Users/karin.hrovatin/opt/miniconda3/envs/flux/lib/python3.8/site-packages/scipy/sparse/_index.py:116: SparseEfficiencyWarning: Changing the sparsity structure of a csc_matrix is expensive. lil_matrix is more efficient.
  self._set_arrayXarray_sparse(i, j, x)
Variable names are not unique. To make them unique, call `.var_names_make_unique`.

data.adata.var
                    index
TSPAN6    ENSG00000000003
TNMD      ENSG00000000005
DPM1      ENSG00000000419
SCYL3     ENSG00000000457
C1orf112  ENSG00000000460
                   ...
havana    ENSG00000288721
F8A1      ENSG00000288722
havana    ENSG00000288723
havana    ENSG00000288724
havana    ENSG00000288725

data.adata.var.index.value_counts().head(50)

havana            17260
havana_tagene      1926
Y_RNA               758
ensembl             675
Metazoa_SRP         169
ensembl_havana      168
U3                   50
U6                   37
SNORA70              27
mirbase              25
U8                   22
U2                   21
5S_rRNA              10
5_8S_rRNA             9
SNORA72               8
SNORA62               7
7SK                   7
SNORA75               7
SNORA63               7
U7                    7
U4                    7
U1                    6
SNORA73               4
Vault                 4
SNORD39               4
SNORA74               4
SNORD81               3
SNORA71               3
SNORD63               3
SNORD116              3
SNORD27               2
LINC01605             2
SNORD115              2
ELFN2                 2
SNORD38B              2
DGCR5                 2
LINC00486             2
GOLGA8M               2
SNORD3D               2
SCARNA4               2
SNORD22               2
ANAPC1P2              2
SNORA50A              2
GABARAPL3             2
SNORA16A              2
PDE11A                2
SFTA3                 2
LINC00484             2
TMSB15B               2
LINC01115             2
dtype: int64

Expected behavior

I am sure some of the symbols are wrong - not gene names.

Also, may be problematic that symbols are repeated for some downstream applications? - Maybe worth adding var make unique within streamline or rather using EID matching in general?

System [please complete the following information]: sfaira recent dev (from 27. 1. 2022)

davidsebfischer commented 2 years ago

Thanks for the issue! The example is essentially replicated by this unit test https://github.com/theislab/sfaira/blob/803aa981a8b1d89e543816ac4016d614f57ff05a/sfaira/unit_tests/tests_by_submodule/data/dataset/test_dataset.py#L128.

I am sure some of the symbols are wrong - not gene names.

This "havanna" labels appear in our internal processed gtf files for release "104", I'll look into where this is coming from. Until then, "102" seems save to use as it does not suffer from this issue (so switch to "102" is the quick fix for now if that works for you). I could imagine that this affects a particular biotype of non-protein-coding genes but will update here.

Also, may be problematic that symbols are repeated for some downstream applications? - Maybe worth adding var make unique within streamline or rather using EID matching in general?

If possible, downstream applications should operate on ENSG IDs if they require unique gene names, I think that would be a bioinformatics best practice. If symbols are required, I would recommend collapsing the matrix by symbols, this makes more sense to me then making IDs unique as this removes the redundancy in a sensible matter, but happy to hear opinions:

from sfaira.data.utils import collapse_matrix

# [...]
data.streamline_features(match_to_release='104')
adata = collapse_matrix(adata=data.adata, var_column="gene_symbol")

davidsebfischer commented 2 years ago

The weird gene symbols above, esp. "havana" were genes without symbol, I fixed these to receive the ID as a symbol in our interface (https://github.com/theislab/sfaira/issues/482). The code is pushed to dev, you will need to delete your cache ~./cache/sfaira for this change to take effect.

Hrovatin commented 2 years ago

Could you also add to documentation of DatasetInteractive that gene_symbol_col takes precedence over gene_ens_col? i was able to keep ens ids instead of symbols only after having set gene_esymbol_col to None.

theislab / sfaira

Streamline features produces wrong gene symbols #482