Open VladimirShitov opened 1 week ago
Good point! I think the idea might be that they’re actually different versions of a gene that just have been mapped to the same gene name (and have e.g. different ENSEMBL IDs)
But of course that depends on how the data was generated, and your method might be more appropriate for some datasets.
Example from the tutorial:
import pooch
import scanpy as sc
EXAMPLE_DATA = pooch.create(
path=pooch.os_cache("scverse_tutorials"),
base_url="doi:10.6084/m9.figshare.22716739.v1/",
)
EXAMPLE_DATA.load_registry_from_doi()
samples = {
"s1d1": "s1d1_filtered_feature_bc_matrix.h5",
"s1d3": "s1d3_filtered_feature_bc_matrix.h5",
}
adatas = {}
for sample_id, filename in samples.items():
path = EXAMPLE_DATA.fetch(filename)
sample_adata = sc.read_10x_h5(path)
#sample_adata.var_names_make_unique()
adatas[sample_id] = sample_adata
adatas["s1d1"].var[adatas["s1d1"].var.index.duplicated(keep=False)]
gene_ids feature_types genome pattern read sequence TBCE ENSG00000285053 Gene Expression GRCh38 TBCE ENSG00000284770 Gene Expression GRCh38 LINC01238 ENSG00000237940 Gene Expression GRCh38 LINC01238 ENSG00000261186 Gene Expression GRCh38 CYB561D2 ENSG00000114395 Gene Expression GRCh38 CYB561D2 ENSG00000271858 Gene Expression GRCh38 MATR3 ENSG00000280987 Gene Expression GRCh38 MATR3 ENSG00000015479 Gene Expression GRCh38 LINC01505 ENSG00000234323 Gene Expression GRCh38 LINC01505 ENSG00000234229 Gene Expression GRCh38 HSPA14 ENSG00000284024 Gene Expression GRCh38 HSPA14 ENSG00000187522 Gene Expression GRCh38 GOLGA8M ENSG00000188626 Gene Expression GRCh38 GOLGA8M ENSG00000261480 Gene Expression GRCh38 GGT1 ENSG00000286070 Gene Expression GRCh38 GGT1 ENSG00000100031 Gene Expression GRCh38 ARMCX5-GPRASP2 ENSG00000271147 Gene Expression GRCh38 ARMCX5-GPRASP2 ENSG00000286237 Gene Expression GRCh38 TMSB15B ENSG00000158427 Gene Expression GRCh38 TMSB15B ENSG00000269226 Gene Expression GRCh38
Thanks for the example! Could we maybe expand the documentation, asking the users to think about this step and select an appropriate strategy? Also, what is the most efficient way to aggregate the features with the same names? Would be great to have a function for that in scanpy
Yes I would also be keen to know the most efficient way to aggregate the features
Sure. I think the fastest way to get this would be if you do a PR!
I checked the genes you showed. These are genes that have alternative transcription start sites and/or alternative transcriptions termination sites, though they significantly intersect. They are transcribed from the same region in the genome +/- N nucleotides. Also, these genes are named the same as gene, however they have different names. For example, one of the variants of MATR3 should is called ENSG00000280987, while other is called MATR3.
https://www.genecards.org/cgi-bin/carddisp.pl?gene=ENSG00000280987 https://www.genecards.org/cgi-bin/carddisp.pl?gene=MATR3&keywords=ENSG00000015479
So this is a problem of the genome annotation, and these are separate genes that should not be aggregated. Moreover, the more unique features you have, even if you don't understand them at the moment, the better.
If you check other genes, you will see that they are different. For example, one variant of GOLGA8M is protein coding, while another one is a lncRNA.
It will be good to use their IDs instead of gene symbols. Otherwise it is impossible to separate one from another.
We can also discuss how good transcript mapping is.
Human Lung Cell Atlas authors also recommended using Ensembl IDs. Probably, for the related reasons. I agree that aggregation might not be the best default approach, but maybe we should explain the problem in the documentation. The genes you mentioned can be a beautiful example.
@ShreyParikh07 , related to your recent problem. Maybe you don't need aggregation as well?
What kind of feature would you like to request?
Additional function parameters / changed functionality / changed defaults?
Please describe your wishes
Hi scanpy team! I have a rather conceptual question. Since the beginning of the single-cell analysis era, one of the standard steps in preprocessing is making the feature names unique (e.g. with
adata.var_names_make_unique()
) by adding suffixes to their names. It is recommended in the scanpy tutorial and in the best practices book. It is clear how identical feature names make the following data processing challenging, but why are we handling it this way? Wouldn't it make more sense to aggregate features with identical names, summing the counts? From the biological point of view, the same gene name means the same feature, so why split it into several features and corrupt their names?