jonassibbesen / rpvg

Method for inferring path posterior probabilities and abundances from pangenome graph read alignments
MIT License
47 stars 6 forks source link

Haplotype transcript length inconsistent with GTF #56

Closed dongdongdong0203 closed 1 year ago

dongdongdong0203 commented 1 year ago

Hi, developer

I recently ran into a problem. I would like to compare the difference in expression between pan-transcriptomic (vg mpmap+RPVG) and conventional transcriptomic analysis (STAR+stingtie2) of the same transcript, however, I found that there is a large difference in expression.

When looking for the cause, it was found that the transcripts were not of the same length in the haplotype transcripts and in the reference GTF file. What is the reason for this?

Next is an example of a particular transcript.

Transcript "ENSSSCT00000055142" in txorigin.tsv file generated by vg autoindex :

Name    Length  Transcripts     Haplotypes
ENSSSCT00000055142_R1   4165    ENSSSCT00000055142      7,20D006155_20D006155#1#7,20D006157_20D006157#0#7
ENSSSCT00000055142_H1   4165    ENSSSCT00000055142      20D006132_20D006132#0#7,20D006132_20D006132#1#7,20D006133_
ENSSSCT00000055142_H2   4165    ENSSSCT00000055142      20D006145_20D006145#1#7,20D006187_20D006187#0#7,20D006719_
ENSSSCT00000055142_H3   4165    ENSSSCT00000055142      20D006184_20D006184#0#7,20D006256_20D006256#0#7,20D006256_
ENSSSCT00000055142_H4   4165    ENSSSCT00000055142      20D006233_20D006233#1#7,20D006251_20D006251#0#7,20D006253_
ENSSSCT00000055142_H5   4165    ENSSSCT00000055142      20D006301_20D006301#0#7,20D006659_20D006659#1#7,20D006662_
ENSSSCT00000055142_H6   4165    ENSSSCT00000055142      20D006357_20D006357#1#7,20D006758_20D006758#0#7,20D007732_
ENSSSCT00000055142_H7   4165    ENSSSCT00000055142      20D006505_20D006505#1#7,20D006919_20D006919#1#7,20D007213_
ENSSSCT00000055142_H8   4165    ENSSSCT00000055142      20D006575_20D006575#0#7,20D007488_20D007488#1#7,
ENSSSCT00000055142_H9   4165    ENSSSCT00000055142      20D006641_20D006641#0#7
ENSSSCT00000055142_H10  4165    ENSSSCT00000055142      20D006684_20D006684#0#7,20D007281_20D007281#1#7,20D007438_
ENSSSCT00000055142_H11  4165    ENSSSCT00000055142      20D006958_20D006958#0#7
ENSSSCT00000055142_H12  4165    ENSSSCT00000055142      20D007579_20D007579#1#7

rpvg run to get the expression of an individual ENSSSCT00000055142 transcript:

Name    ClusterID       Length  EffectiveLength ReadCount       TPM
ENSSSCT00000055142      72      4165    3913.4432       561.9941        780.99766

However, the expression of this transcript obtained by stringtie2 was far from RPVG on the same individual, and the transcript length was inconsistent.

7   StringTie   transcript  9971209 9983829 1000    +   .   gene_id "ENSSSCG00000040816"; transcript_id "ENSSSCT00000055142"; ref_gene_name "NOL7"; cov "109.753166"; FPKM "18.930826"; TPM "18.137791";

Looking forward to your answers and replies.

Thanks

jeizenga commented 1 year ago

Can you share the corresponding portion of the reference GTF? My suspicion is that the discrepancy in length is actually because the StringTie GTF record you are showing includes the introns and the rpvg output does not.

dongdongdong0203 commented 1 year ago

Thank you for your prompt reply. You are right, we compared different transcripts and found that transcripts of stringtie2 were containing introns, while rpvg only outputs the exons.

jeizenga commented 1 year ago

Great, glad we were able to clear that up for you.