scverse / scanpy

Single-cell analysis in Python. Scales to >1M cells.
https://scanpy.readthedocs.io
BSD 3-Clause "New" or "Revised" License
1.87k stars 594 forks source link

Why making feature names unique instead of aggregation? #3241

Open VladimirShitov opened 1 week ago

VladimirShitov commented 1 week ago

What kind of feature would you like to request?

Additional function parameters / changed functionality / changed defaults?

Please describe your wishes

Hi scanpy team! I have a rather conceptual question. Since the beginning of the single-cell analysis era, one of the standard steps in preprocessing is making the feature names unique (e.g. with adata.var_names_make_unique()) by adding suffixes to their names. It is recommended in the scanpy tutorial and in the best practices book. It is clear how identical feature names make the following data processing challenging, but why are we handling it this way? Wouldn't it make more sense to aggregate features with identical names, summing the counts? From the biological point of view, the same gene name means the same feature, so why split it into several features and corrupt their names?

flying-sheep commented 6 days ago

Good point! I think the idea might be that they’re actually different versions of a gene that just have been mapped to the same gene name (and have e.g. different ENSEMBL IDs)

But of course that depends on how the data was generated, and your method might be more appropriate for some datasets.

Example from the tutorial:

import pooch
import scanpy as sc

EXAMPLE_DATA = pooch.create(
    path=pooch.os_cache("scverse_tutorials"),
    base_url="doi:10.6084/m9.figshare.22716739.v1/",
)
EXAMPLE_DATA.load_registry_from_doi()

samples = {
    "s1d1": "s1d1_filtered_feature_bc_matrix.h5",
    "s1d3": "s1d3_filtered_feature_bc_matrix.h5",
}
adatas = {}

for sample_id, filename in samples.items():
    path = EXAMPLE_DATA.fetch(filename)
    sample_adata = sc.read_10x_h5(path)
    #sample_adata.var_names_make_unique()
    adatas[sample_id] = sample_adata

adatas["s1d1"].var[adatas["s1d1"].var.index.duplicated(keep=False)]
gene_ids feature_types genome pattern read sequence
TBCE ENSG00000285053 Gene Expression GRCh38
TBCE ENSG00000284770 Gene Expression GRCh38
LINC01238 ENSG00000237940 Gene Expression GRCh38
LINC01238 ENSG00000261186 Gene Expression GRCh38
CYB561D2 ENSG00000114395 Gene Expression GRCh38
CYB561D2 ENSG00000271858 Gene Expression GRCh38
MATR3 ENSG00000280987 Gene Expression GRCh38
MATR3 ENSG00000015479 Gene Expression GRCh38
LINC01505 ENSG00000234323 Gene Expression GRCh38
LINC01505 ENSG00000234229 Gene Expression GRCh38
HSPA14 ENSG00000284024 Gene Expression GRCh38
HSPA14 ENSG00000187522 Gene Expression GRCh38
GOLGA8M ENSG00000188626 Gene Expression GRCh38
GOLGA8M ENSG00000261480 Gene Expression GRCh38
GGT1 ENSG00000286070 Gene Expression GRCh38
GGT1 ENSG00000100031 Gene Expression GRCh38
ARMCX5-GPRASP2 ENSG00000271147 Gene Expression GRCh38
ARMCX5-GPRASP2 ENSG00000286237 Gene Expression GRCh38
TMSB15B ENSG00000158427 Gene Expression GRCh38
TMSB15B ENSG00000269226 Gene Expression GRCh38
VladimirShitov commented 5 days ago

Thanks for the example! Could we maybe expand the documentation, asking the users to think about this step and select an appropriate strategy? Also, what is the most efficient way to aggregate the features with the same names? Would be great to have a function for that in scanpy

ShreyParikh07 commented 5 days ago

Yes I would also be keen to know the most efficient way to aggregate the features

flying-sheep commented 5 days ago

Sure. I think the fastest way to get this would be if you do a PR!

dekan-aleksandr commented 6 hours ago

I checked the genes you showed. These are genes that have alternative transcription start sites and/or alternative transcriptions termination sites, though they significantly intersect. They are transcribed from the same region in the genome +/- N nucleotides. Also, these genes are named the same as gene, however they have different names. For example, one of the variants of MATR3 should is called ENSG00000280987, while other is called MATR3.

https://www.genecards.org/cgi-bin/carddisp.pl?gene=ENSG00000280987 https://www.genecards.org/cgi-bin/carddisp.pl?gene=MATR3&keywords=ENSG00000015479

So this is a problem of the genome annotation, and these are separate genes that should not be aggregated. Moreover, the more unique features you have, even if you don't understand them at the moment, the better.

If you check other genes, you will see that they are different. For example, one variant of GOLGA8M is protein coding, while another one is a lncRNA.

It will be good to use their IDs instead of gene symbols. Otherwise it is impossible to separate one from another.

We can also discuss how good transcript mapping is.

VladimirShitov commented 5 hours ago

Human Lung Cell Atlas authors also recommended using Ensembl IDs. Probably, for the related reasons. I agree that aggregation might not be the best default approach, but maybe we should explain the problem in the documentation. The genes you mentioned can be a beautiful example.

@ShreyParikh07 , related to your recent problem. Maybe you don't need aggregation as well?