satijalab / seurat

R toolkit for single cell genomics
http://www.satijalab.org/seurat
Other
2.24k stars 902 forks source link

Pull transcript id instead of gene id #4193

Closed aelias1 closed 3 years ago

aelias1 commented 3 years ago

I want to look at alternative splicing. My gtf has both gene name and transcript names that make this possible. I was wondering if there's a way to pull this information per cluster using seurat and conduct DE analysis using transcript ids. Thank you!

samuel-marsh commented 3 years ago

Hi,

Not member of the dev team but hopefully can be helpful. So this is mainly not a Seurat question unless you already have the differential isoform transcripts counts from your alignment pipeline. If you do then you may want to consider re-running your analysis using transcript counts instead of gene counts because the results of your clustering may change compared to summing all isoforms together.

If you don't have the isoform count info then it's going to depend on how the sequencing reads were aligned and how the counts for each gene were generated (pre-Seurat). Different pipelines will give different outputs and those will effect your ability to look at different transcript isoforms. For instance you can find information from 10X Genomics here on how counting is performed via CellRanger (select appropriate version in bottom right).
As another note the ability to detect different transcript isoforms is also going to be dependent on the type of single cell sequencing performed (full length vs. 3' capture vs. 5' capture & traditional short read Illumina vs. long read PacBio/ONT). This will further effect ability of downstream aligners to detect different isoform transcripts accurately.

Best, Sam

gadepallivs commented 1 year ago

@samuel-marsh I have used Kb-python to generate the TCC counts. Now, the TCC matrix has row names as transcript ID. I have trouble annotating the clusters for TCC matrix as these are transcript names. mapping gene names, results in duplicate row names. Is there an approach to annotate clusters when you have transcript IDs ? I'm trying to look at the average expression of transcripts across different clusters. P.S: This is a purely computational approach on 10x 3' data, with the caveat that it does not capture accurate transcript expression.