gpertea / stringtie

Transcript assembly and quantification for RNA-Seq
MIT License
365 stars 76 forks source link

duplicate transcript output with -L -e #357

Open AmrSaadeldin opened 2 years ago

AmrSaadeldin commented 2 years ago

Hi,

I am using the latest version (2.2) of Stringtie and while using the expression estimation mode (-e) I lost ~1k transcripts compared to the denovo assembled GTF file. Also, the number of transcripts estimated (-e output files) are different and not the same when I used the same -G GTF file!

I used the same files with an older version of stringtie and didn't face this problem!

AmrSaadeldin commented 2 years ago

I updated stringtie to V2.2.1 - But now I retrieve new isoforms in -e mode.

gpertea commented 2 years ago

Indeed v2.2.0 had a bug with not showing all the input transcripts, as you observed. However v2.2.1 should have fixed that -- but now you seem to report that new isoforms are produced when -e option is used, which should not happen..

Can you please provide more information about your findings ? Are you getting that with long reads data (-L option), or hybrid data (--mix) ? An example dataset to reproduce the issue would be greatly appreciated.

gpertea commented 2 years ago

Thank you for sharing the example data -- apparently with -L -e sometimes StringTie writes out multiple abundance estimates for the same transcript ID.
[EDIT: the outputs are not duplicated, but rather independent "predicitions"]

gpertea commented 2 years ago

Debugging note: in one such case with a duplicate single-exon transcript estimate, I see in print_predcluster() 3 instances of the same transcript ID passed in the pred list, 2 of them having 0 coverage and different exon coordinates (shorter exon, contained in the real (input) exon), the 3rd instance being the real one with non-zero coverage and correct exon coordinates.

niradsp commented 2 years ago

Hello, Is the duplicate transcript ID bug the reason that some of the samples are showing abundance value as zero? Please see below: This is not a single-exon transcript.

[1] 2500.7836 1425.0903 0.0000 2431.1512 480.4571 0.0000 933.6469 0.0000 [9] 0.0000 0.0000 813.9988 3317.5881 762.0982 778.2828 682.3964 0.0000 [17] 1742.9656 306.2472 654.1184 0.0000 434.4180 0.0000 0.0000 612.1603 [25] 0.0000 350.2606 325.1392 0.0000 1381.3899 1218.0039 0.0000 1082.9444 [33] 0.0000 1149.3413 0.0000 0.0000 1750.8320 0.0000 1201.1724 2905.0471 [41] 0.0000 833.6149

This is using tximport, but prepde.py3 is showing the same pattern, where some samples are showing value of zero (even when the treatment is the same). So this was using the -L and -e option.