gpertea / stringtie

Transcript assembly and quantification for RNA-Seq
MIT License
358 stars 75 forks source link

transcripts with identical coordinates but different abundances #426

Open zjuezhen opened 2 months ago

zjuezhen commented 2 months ago

Hello,

Thanks so much for developing this tool, I find it quite easy to use.

I'm wondering what's causing duplicated transcripts (identical coordinates, different transcript id) to have different quantifications within the same sample? I see that the duplicated transcript is coming from using a reference gtf (human, GENCODE v37) that contains two transcripts with different transcript_id but the same genomic coordinate (chr1:154582076-154585043:154582076-154608204:154585217-154585865:154586181-154586363:154588125-154588258:154588551-154588673:154589369-154589462:154589757-154590409:154596805-154596995:154597123-154597267:154597828-154597976:154598402-154598585:154601041-154602626:154607992-154608204:-). Does this mean the reference gtf should be pre-filtered to contain only transcripts with unique genomic coordinates?

image image

Looking at the coverage for the region chr1:154582076-154608204 in IGV seemed to suggest that the higher quantification (1st screenshot) is correct: igv_snapshot_ACTTGA_chr1_154582076-154608204_neg

The command line call was:

stringtie \
-G ref_annot.gtf \
--ref ref_genome.fa \
-o STRG_transcripts.${sampleID}.gtf \
-A STRG_gene_abundances.${sampleID}.tab 
-B --rf -t -c 1 -f 0.01 -M 0.95 -p 50 -v ${bamdir}/${sampleID}.sortedByCoord.out.bam

Any advice on this is greatly appreciated!

Best regards, Jenny

gpertea commented 2 months ago

I find it surprising that those two transcripts have exactly the same exon coordinates and they are both in the same reference GTF, with different IDs like this. I guess we did not take that into account.

Not quite sure, but if structurally they are fully identical, StringTie may still be forced to split the coverage between the two, likely based on subtle differences at the terminal exons (first and last bases of the transcript as reflected in the read alignments, usually they do not fully match the annotation coordinates).

zjuezhen commented 2 months ago

Thank you for the quick response and explanation.

I checked in Ensembl and they do indeed have identical exons (http://dec2021.archive.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG00000160710;r=1:154581695-154628013: transcripts ADAR-222 & ADAR-232). In the reference gtf I used, there are ~180 duplicated transcripts (~0.08% of all transcripts) corresponding to 90 transcripts with distinct coordinates; perhaps this is due to the gtf being the GENCODE comprehensive gene annotations, and this wouldn't be an issue if the GENCODE basic gene annotations were used. Regardless, I think I will pre-filter the reference GTF to remove these duplicated cases to make the transcript quantification results less ambiguous.

Another followup question: if the first and last bases of the transcript according to the read alignments do not fully match the annotation coordinates, I'm guessing this would be handled by the coverage trimming method; if after the coverage trimming, the first and last bases of the transcript still do not match the annotated coordinates, would a new transcript be reported instead of the annotated transcript?

gpertea commented 2 months ago

Coverage trimming can be imprecise sometimes, as there might not be an obvious "drop" in coverage, the coverage might just taper off instead. Or there could be multiple such "drops" - and in that case it is possible that we have a case of alternate TSS, so multiple transcripts with the same intron strucure may be reported in the same location - but in that case we should see a different start (TSS).

This ties into your last question - the way we assign the reference (annotated) transcript there is by the intron chain matching, not by full/exact exon match, because we never expect the transcription start and end coordinates (i.e. the start of the first exon and the end of the last exon) to match precisely the annotated transcript, even after some clear-cut "coverage trimming". Transcription start/end can be quite"fuzzy" that way, as there is also sequencing bias involved..

gpertea commented 2 months ago

Indeed, I am still puzzled by ADAR-222 & ADAR-232 being shown as 2 distinct transcripts in Ensembl, if they have the exact same exon coordinates, I cannot see anything that makes them worth two records in Ensembl or any transcript catalog, instead of a single one. They even claim:

Annotation Method Manual annotation (determined on a case-by-case basis) from the Havana project.

And you said that you found more like this.. I understand if they had different TSS/TES (with the same intron structure), but everything identical.. Cannot fathom the logic.