gpertea / stringtie

Transcript assembly and quantification for RNA-Seq
MIT License
378 stars 78 forks source link

Inconsistent number of records in the -A output #250

Open dbrg77 opened 4 years ago

dbrg77 commented 4 years ago

Hello,

I'm using stringtie (v2.0) to estimate the gene abundance, and we are only looking at the reference genes. We downloaded the gencode vM22 gtf from the UCSC table browser, and run stringtie like this:

stringtie -p 12 -l sample_name \
-G ~/reference/mus_musculus/ucsc/mm10/gtf/mm10_gencode_vM22_basic_formatted.gtf \
-eB -o output_gtf.gtf -A output_expr.tsv input.bam

If I understand correctly, the -e option only estimates genes in the reference gtf provided by-G. Therefore, I expected the same number of genes from different samples.

However, these are the numbers of records that I get from different samples with the same parameters on the same reference gtf:

$ wc -l */*.tsv
   41955 Mg1/Mg1_expr_table.tsv
   41955 Mg2/Mg2_expr_table.tsv
   41950 Mg3/Mg3_expr_table.tsv
   41949 WT1/WT1_expr_table.tsv
   41949 WT2/WT2_expr_table.tsv
   41946 WT3/WT3_expr_table.tsv

An example is the gene Snhg14. It has four records in the Mg1 sample:

Snhg14  -   chr7    -   59619158    59904686    0.000000    0.000000    0.000000
Snhg14  -   chr7    -   59307924    59324149    0.384054    0.079999    0.064409
Snhg14  -   chr7    -   59937371    59975759    56.918434   31.103039   25.041645
Snhg14  .   chr7    -   59445703    59453239    0.0 0.0 0.0

But it only has 3 records in the WT3 sample:

Snhg14  -   chr7    -   59307924    59324149    0.316693    0.062106    0.074469
Snhg14  -   chr7    -   59937371    59975759    29.174150   13.995778   16.781746
Snhg14  .   chr7    -   59445703    59904686    0.0 0.0 0.0

In addition, since the -A option is supposed to provide gene abundance estimation, the record of Snhg14 should be merged into one, right?

Look forward to your reply.

Thank you.

Regards, Xi

mpertea commented 4 years ago

Dear Xi,

Could you please try stringtie 2.0.4? There were some bugs in 2.0 when using the -e option but we fixed them.

dbrg77 commented 4 years ago

Hi Mihaela,

Thanks for the reply. I have just tried v2.0.4, and the problem still exists.

If it helps, this is the lines in the gtf file that are related to Snhg14:

chr7    mm10_wgEncodeGencodeBasicVM22   exon    59619158    59619280    0.000000    -   gene_id "Snhg14"; transcript_id "ENSMUST00000185693.6";
chr7    mm10_wgEncodeGencodeBasicVM22   exon    59667850    59667991    0.000000    -   gene_id "Snhg14"; transcript_id "ENSMUST00000185693.6";
chr7    mm10_wgEncodeGencodeBasicVM22   exon    59669588    59669742    0.000000    -   gene_id "Snhg14"; transcript_id "ENSMUST00000185693.6";
chr7    mm10_wgEncodeGencodeBasicVM22   exon    59673810    59673956    0.000000    -   gene_id "Snhg14"; transcript_id "ENSMUST00000185693.6";
chr7    mm10_wgEncodeGencodeBasicVM22   exon    59676338    59676484    0.000000    -   gene_id "Snhg14"; transcript_id "ENSMUST00000185693.6";
chr7    mm10_wgEncodeGencodeBasicVM22   exon    59874730    59874749    0.000000    -   gene_id "Snhg14"; transcript_id "ENSMUST00000185693.6";
chr7    mm10_wgEncodeGencodeBasicVM22   exon    59711359    59711505    0.000000    -   gene_id "Snhg14"; transcript_id "ENSMUST00000188162.1";
chr7    mm10_wgEncodeGencodeBasicVM22   exon    59713880    59714026    0.000000    -   gene_id "Snhg14"; transcript_id "ENSMUST00000188162.1";
chr7    mm10_wgEncodeGencodeBasicVM22   exon    59716409    59716555    0.000000    -   gene_id "Snhg14"; transcript_id "ENSMUST00000188162.1";
chr7    mm10_wgEncodeGencodeBasicVM22   exon    59718938    59719084    0.000000    -   gene_id "Snhg14"; transcript_id "ENSMUST00000188162.1";
chr7    mm10_wgEncodeGencodeBasicVM22   exon    59721475    59721621    0.000000    -   gene_id "Snhg14"; transcript_id "ENSMUST00000188162.1";
chr7    mm10_wgEncodeGencodeBasicVM22   exon    59904652    59904686    0.000000    -   gene_id "Snhg14"; transcript_id "ENSMUST00000188162.1";
chr7    mm10_wgEncodeGencodeBasicVM22   exon    59307924    59309726    0.000000    -   gene_id "Snhg14"; transcript_id "ENSMUST00000190666.1";
chr7    mm10_wgEncodeGencodeBasicVM22   exon    59317859    59318062    0.000000    -   gene_id "Snhg14"; transcript_id "ENSMUST00000190666.1";
chr7    mm10_wgEncodeGencodeBasicVM22   exon    59322449    59324149    0.000000    -   gene_id "Snhg14"; transcript_id "ENSMUST00000190666.1";
chr7    mm10_wgEncodeGencodeBasicVM22   exon    59445703    59445766    0.000000    -   gene_id "Snhg14"; transcript_id "ENSMUST00000188262.1";
chr7    mm10_wgEncodeGencodeBasicVM22   exon    59447089    59447134    0.000000    -   gene_id "Snhg14"; transcript_id "ENSMUST00000188262.1";
chr7    mm10_wgEncodeGencodeBasicVM22   exon    59447515    59447637    0.000000    -   gene_id "Snhg14"; transcript_id "ENSMUST00000188262.1";
chr7    mm10_wgEncodeGencodeBasicVM22   exon    59448958    59449003    0.000000    -   gene_id "Snhg14"; transcript_id "ENSMUST00000188262.1";
chr7    mm10_wgEncodeGencodeBasicVM22   exon    59449384    59449506    0.000000    -   gene_id "Snhg14"; transcript_id "ENSMUST00000188262.1";
chr7    mm10_wgEncodeGencodeBasicVM22   exon    59450830    59450875    0.000000    -   gene_id "Snhg14"; transcript_id "ENSMUST00000188262.1";
chr7    mm10_wgEncodeGencodeBasicVM22   exon    59451256    59451378    0.000000    -   gene_id "Snhg14"; transcript_id "ENSMUST00000188262.1";
chr7    mm10_wgEncodeGencodeBasicVM22   exon    59453124    59453239    0.000000    -   gene_id "Snhg14"; transcript_id "ENSMUST00000188262.1";
chr7    mm10_wgEncodeGencodeBasicVM22   exon    59666434    59667991    0.000000    -   gene_id "Snhg14"; transcript_id "ENSMUST00000191191.1";
chr7    mm10_wgEncodeGencodeBasicVM22   exon    59669588    59669742    0.000000    -   gene_id "Snhg14"; transcript_id "ENSMUST00000191191.1";
chr7    mm10_wgEncodeGencodeBasicVM22   exon    59673810    59673956    0.000000    -   gene_id "Snhg14"; transcript_id "ENSMUST00000191191.1";
chr7    mm10_wgEncodeGencodeBasicVM22   exon    59676338    59676484    0.000000    -   gene_id "Snhg14"; transcript_id "ENSMUST00000191191.1";
chr7    mm10_wgEncodeGencodeBasicVM22   exon    59676618    59678601    0.000000    -   gene_id "Snhg14"; transcript_id "ENSMUST00000191191.1";
chr7    mm10_wgEncodeGencodeBasicVM22   exon    59676393    59676484    0.000000    -   gene_id "Snhg14"; transcript_id "ENSMUST00000185815.1";
chr7    mm10_wgEncodeGencodeBasicVM22   exon    59678859    59679005    0.000000    -   gene_id "Snhg14"; transcript_id "ENSMUST00000185815.1";
chr7    mm10_wgEncodeGencodeBasicVM22   exon    59681404    59681550    0.000000    -   gene_id "Snhg14"; transcript_id "ENSMUST00000185815.1";
chr7    mm10_wgEncodeGencodeBasicVM22   exon    59683936    59684082    0.000000    -   gene_id "Snhg14"; transcript_id "ENSMUST00000185815.1";
chr7    mm10_wgEncodeGencodeBasicVM22   exon    59696094    59696230    0.000000    -   gene_id "Snhg14"; transcript_id "ENSMUST00000185815.1";
chr7    mm10_wgEncodeGencodeBasicVM22   exon    59692438    59692898    0.000000    -   gene_id "Snhg14"; transcript_id "ENSMUST00000185272.1";
chr7    mm10_wgEncodeGencodeBasicVM22   exon    59693569    59693715    0.000000    -   gene_id "Snhg14"; transcript_id "ENSMUST00000185272.1";
chr7    mm10_wgEncodeGencodeBasicVM22   exon    59696094    59696211    0.000000    -   gene_id "Snhg14"; transcript_id "ENSMUST00000185272.1";
chr7    mm10_wgEncodeGencodeBasicVM22   exon    59698710    59698803    0.000000    -   gene_id "Snhg14"; transcript_id "ENSMUST00000186345.1";
chr7    mm10_wgEncodeGencodeBasicVM22   exon    59701228    59701374    0.000000    -   gene_id "Snhg14"; transcript_id "ENSMUST00000186345.1";
chr7    mm10_wgEncodeGencodeBasicVM22   exon    59703777    59703923    0.000000    -   gene_id "Snhg14"; transcript_id "ENSMUST00000186345.1";
chr7    mm10_wgEncodeGencodeBasicVM22   exon    59706316    59706383    0.000000    -   gene_id "Snhg14"; transcript_id "ENSMUST00000186345.1";
chr7    mm10_wgEncodeGencodeBasicVM22   exon    59770277    59771948    0.000000    -   gene_id "Snhg14"; transcript_id "ENSMUST00000187666.1";
chr7    mm10_wgEncodeGencodeBasicVM22   exon    59859507    59860126    0.000000    -   gene_id "Snhg14"; transcript_id "ENSMUST00000187666.1";
chr7    mm10_wgEncodeGencodeBasicVM22   exon    59862644    59862868    0.000000    -   gene_id "Snhg14"; transcript_id "ENSMUST00000187666.1";
chr7    mm10_wgEncodeGencodeBasicVM22   exon    59937371    59940989    0.000000    -   gene_id "Snhg14"; transcript_id "ENSMUST00000189581.6";
chr7    mm10_wgEncodeGencodeBasicVM22   exon    59944576    59944714    0.000000    -   gene_id "Snhg14"; transcript_id "ENSMUST00000189581.6";
chr7    mm10_wgEncodeGencodeBasicVM22   exon    59945202    59945305    0.000000    -   gene_id "Snhg14"; transcript_id "ENSMUST00000189581.6";
chr7    mm10_wgEncodeGencodeBasicVM22   exon    59946515    59946640    0.000000    -   gene_id "Snhg14"; transcript_id "ENSMUST00000189581.6";
chr7    mm10_wgEncodeGencodeBasicVM22   exon    59955645    59955724    0.000000    -   gene_id "Snhg14"; transcript_id "ENSMUST00000189581.6";
chr7    mm10_wgEncodeGencodeBasicVM22   exon    59956682    59956800    0.000000    -   gene_id "Snhg14"; transcript_id "ENSMUST00000189581.6";
chr7    mm10_wgEncodeGencodeBasicVM22   exon    59969162    59969314    0.000000    -   gene_id "Snhg14"; transcript_id "ENSMUST00000189581.6";
chr7    mm10_wgEncodeGencodeBasicVM22   exon    59970284    59970543    0.000000    -   gene_id "Snhg14"; transcript_id "ENSMUST00000189581.6";
chr7    mm10_wgEncodeGencodeBasicVM22   exon    59971880    59971944    0.000000    -   gene_id "Snhg14"; transcript_id "ENSMUST00000189581.6";
chr7    mm10_wgEncodeGencodeBasicVM22   exon    59973271    59973650    0.000000    -   gene_id "Snhg14"; transcript_id "ENSMUST00000189581.6";
chr7    mm10_wgEncodeGencodeBasicVM22   exon    59973847    59973899    0.000000    -   gene_id "Snhg14"; transcript_id "ENSMUST00000189581.6";
chr7    mm10_wgEncodeGencodeBasicVM22   exon    59973997    59975759    0.000000    -   gene_id "Snhg14"; transcript_id "ENSMUST00000189581.6";
chr7    mm10_wgEncodeGencodeBasicVM22   exon    59946630    59946640    0.000000    -   gene_id "Snhg14"; transcript_id "ENSMUST00000185890.6";
chr7    mm10_wgEncodeGencodeBasicVM22   exon    59955645    59955724    0.000000    -   gene_id "Snhg14"; transcript_id "ENSMUST00000185890.6";
chr7    mm10_wgEncodeGencodeBasicVM22   exon    59969162    59969314    0.000000    -   gene_id "Snhg14"; transcript_id "ENSMUST00000185890.6";
chr7    mm10_wgEncodeGencodeBasicVM22   exon    59970284    59970543    0.000000    -   gene_id "Snhg14"; transcript_id "ENSMUST00000185890.6";
chr7    mm10_wgEncodeGencodeBasicVM22   exon    59971880    59971944    0.000000    -   gene_id "Snhg14"; transcript_id "ENSMUST00000185890.6";
chr7    mm10_wgEncodeGencodeBasicVM22   exon    59973271    59973457    0.000000    -   gene_id "Snhg14"; transcript_id "ENSMUST00000185890.6";
chr7    mm10_wgEncodeGencodeBasicVM22   exon    59969577    59970543    0.000000    -   gene_id "Snhg14"; transcript_id "ENSMUST00000188976.1";
chr7    mm10_wgEncodeGencodeBasicVM22   exon    59971880    59971944    0.000000    -   gene_id "Snhg14"; transcript_id "ENSMUST00000188976.1";
chr7    mm10_wgEncodeGencodeBasicVM22   exon    59973271    59974431    0.000000    -   gene_id "Snhg14"; transcript_id "ENSMUST00000188976.1";

Thanks.

Regards, Xi