gpertea / stringtie

Transcript assembly and quantification for RNA-Seq
MIT License
365 stars 76 forks source link

FPKM values of 0 in single samples in single genes #141

Open MassaBob opened 7 years ago

MassaBob commented 7 years ago

I used stringtie to analyse the output of my hisat alignment in >30 human samples in parallel and detected single samples with FPKM values of 0 for several genes although the regions are well covered by reads.

This seems to resemble https://github.com/gpertea/stringtie/issues/102. Using cufflinks I get FPKM values as expected.

For troubleshooting I picked two of the samples showing 0PKM in my favourite gene SOX2 and two samples showing high FPKMs and repeated the analysis using only reads mapping to the gene. I could recapitulate the error. I changed my pipeline using STAR aligner for mapping followed by stringtie analysis and this led to FPKM values >0 in three of the four samples, i.e. recovering one sample. Applying Cufflinks instead of stringtie showed high FPKM values for all 4 samples.

Attached you will find sam files created with hisat and STAR covering the SOX2 region for one "good" sample (Sample 81) and the sample that could be rescued using STAR aligner (sample 163).
SAM_ARCHIVE.zip

SOX2 is well covered in both samples: slide1 slide2

Furthermore, I added gtf output for the two samples of all four different analysis combinations (hisat vs STAR + Stringtie vs cufflinks) with the SOX2 gene extracted. SOX2_gtfs.zip

I appreciate any help or any hint what might have happened here and how to deal with these (recurring) issues in the future.

gpertea commented 7 years ago

Thank you for providing the .sam files but it seems that I am not able to reproduce the issues with just the provided data. I simply ran stringtie -o s163.gtf Sample163_hisat.sam and the output shows large FPKM and coverage values:

# stringtie -o s163.gtf Sample163_hisat.sam
# StringTie version 1.3.3b
chr3    StringTie   transcript  181712403   181715282   1000    +   .   gene_id "STRG.1"; transcript_id "STRG.1.1"; cov "242.097504"; FPKM "382242.812500"; TPM "497675.312500";
chr3    StringTie   exon    181712403   181714359   1000    +   .   gene_id "STRG.1"; transcript_id "STRG.1.1"; exon_number "1"; cov "254.133179";
chr3    StringTie   exon    181715185   181715282   1000    +   .   gene_id "STRG.1"; transcript_id "STRG.1.1"; exon_number "2"; cov "1.752461";
chr3    StringTie   transcript  181711860   181712281   1000    .   .   gene_id "STRG.2"; transcript_id "STRG.2.1"; cov "236.135071"; FPKM "372828.843750"; TPM "485418.468750";
chr3    StringTie   exon    181711860   181712281   1000    .   .   gene_id "STRG.2"; transcript_id "STRG.2.1"; exon_number "1"; cov "236.135071";

So the problem might be context-related. What was the stringtie command line you used? Did you run with the "-e" option (it seems to) ? In that case I would also need the reference transcripts you used with the -G option.

It would be helpful if you could follow the instructions provided here to extract the read alignments and the reference transcripts for a region of interest and also provide the exact command line you used. I hope the version of stringtie you used is also the latest available (v1.3.3b).

MassaBob commented 7 years ago

Hi Geo, thanks for the fast reply. Here are the commands I used to extract and to stringtie read data for the SOX2 region (reference hg38): $ samtools view -h sample81.bam chr3:181711924-181714436 > sample81_sox2.sam $ stringtie sample81_sox2.sam \ -o ./02_STRINGTIE/sample81_SOX2_stringtie_output.gtf \ -p $threads \ -G /Homo_sapiens/UCSC/hg38/Annotation/Genes/genes.gtf \ -b ./03_BALLGOWN/sample81_SOX2 \ -e \ -v

Indeed, I used version 1.2 for the calculations since previous data of the project was also analysed with this version one year ago.

I turned to version 1.3 for the samples - and it seems that for this gene as well as for others the 0FPKM issue seems to be solved. Although novel genes with 0FPKM appear that showed high FPKMs in version 1.2. I have not checked, if this is the same issue or if they are truly uncovered. Attached you will find a pdf with a comparison between v1.2 and v1.3 (the current version) exhibiting tremendous differences. The red star marks SOX2. STAR_Stringtie2_vsStringtie3_unlog.pdf

Yunqing-Yu commented 6 years ago

I have the same problem. There are genes that are highly expressed in one sample, but the FPKM is 0 in another sample. When I visualize the reads in IGV, there are reads in both samples. I am very new in the RNAseq analysis field, and I am not sure how to troubleshoot it. The stringtie version I am using is v1.2.3. Does it suggest that I should use cufflink instead?

gpertea commented 6 years ago

You should use the current version of stringtie (v1.3.3b) which most likely fixed this problem you were having with the older version v1.2.3. If you are still experiencing the same problem with the last (current) version, then post here some example data and I'll be glad to take a look.

The original poster who needlessly opened this issue admitted:

I turned to version 1.3 for the samples - and it seems that for this gene as well as for others the 0FPKM issue seems to be solved.

yqy151, is that the case for you too?

There has always been a reasonable assumption in the developer-user communication when it comes to "bug reports" like this: there is really no point in complaining about problems with an older version of a software unless one can show that the problem still occurs in the current version. Otherwise it is just a waste of everybody's time, isn't it?

SimonJYoung commented 6 years ago

@gpertea I also have the same problem even if I use the current (v1.3.4d) and above refered (v1.3.3b) versions of stringtie. I have 5 samples, named Ct1, Ct2, X30, X43 and X49. and my 2nd pass CMD of stringtie is like: v1.3.3b/stringtie -e -B -p 8 -G stringtie_merged.gtf -o $sample.gtf $sample.hisat2.aln.sort.bam

when troubleshooting with gene BDP1 as an example, I find that FPKM of all transcripts in sample X30, X43 and X49 are 0, which is conflict with what i see in IGV.

the following is meaningful context of gtf files generated by above CMD, and I guess that FPKM will equal 0 if the SCORE field is not 1000 (but I don't know why).

$ grep -w BDP1 *.gtf |awk '$3=="transcript"' |cut -f 1,6,9 |column
Ct1.gtf:chr5  1000  gene_id "MSTRG.21901"; transcript_id "ENST00000358731"; ref_gene_name "BDP1"; cov "20.334961"; FPKM "2.731615"; TPM "6.655488";
Ct1.gtf:chr5  1000  gene_id "MSTRG.21901"; transcript_id "ENST00000508917"; ref_gene_name "BDP1"; cov "3.211188"; FPKM "0.431362"; TPM "1.050999";
Ct1.gtf:chr5  1000  gene_id "MSTRG.21901"; transcript_id "ENST00000380675"; ref_gene_name "BDP1"; cov "0.100848"; FPKM "0.013547"; TPM "0.033007";
Ct1.gtf:chr5  1000  gene_id "MSTRG.21901"; transcript_id "ENST00000514903"; ref_gene_name "BDP1"; cov "2.167251"; FPKM "0.291129"; TPM "0.709326";
Ct1.gtf:chr5  1000  gene_id "MSTRG.21901"; transcript_id "ENST00000525844"; ref_gene_name "BDP1"; cov "0.348723"; FPKM "0.046844"; TPM "0.114135";
Ct1.gtf:chr5  1000  gene_id "MSTRG.21901"; transcript_id "ENST00000508157"; ref_gene_name "BDP1"; cov "0.781526"; FPKM "0.104983"; TPM "0.255788";
Ct2.gtf:chr5  .     gene_id "MSTRG.21901"; transcript_id "ENST00000525844"; ref_gene_name "BDP1"; cov "0.0"; FPKM "0.000000"; TPM "0.000000";
Ct2.gtf:chr5  1000  gene_id "MSTRG.21901"; transcript_id "ENST00000358731"; ref_gene_name "BDP1"; cov "7.624768"; FPKM "1.180269"; TPM "2.880205";
Ct2.gtf:chr5  1000  gene_id "MSTRG.21901"; transcript_id "ENST00000508917"; ref_gene_name "BDP1"; cov "20.757805"; FPKM "3.213186"; TPM "7.841123";
Ct2.gtf:chr5  1000  gene_id "MSTRG.21901"; transcript_id "ENST00000380675"; ref_gene_name "BDP1"; cov "0.041717"; FPKM "0.006458"; TPM "0.015758";
Ct2.gtf:chr5  1000  gene_id "MSTRG.21901"; transcript_id "ENST00000514903"; ref_gene_name "BDP1"; cov "6.087821"; FPKM "0.942359"; TPM "2.299634";
Ct2.gtf:chr5  1000  gene_id "MSTRG.21901"; transcript_id "ENST00000508157"; ref_gene_name "BDP1"; cov "0.867580"; FPKM "0.134296"; TPM "0.327723";
X30.gtf:chr5  .     gene_id "MSTRG.21901"; transcript_id "ENST00000358731"; ref_gene_name "BDP1"; cov "0.0"; FPKM "0.000000"; TPM "0.000000";
X30.gtf:chr5  .     gene_id "MSTRG.21901"; transcript_id "ENST00000380675"; ref_gene_name "BDP1"; cov "0.0"; FPKM "0.000000"; TPM "0.000000";
X30.gtf:chr5  .     gene_id "MSTRG.21901"; transcript_id "ENST00000508917"; ref_gene_name "BDP1"; cov "0.0"; FPKM "0.000000"; TPM "0.000000";
X30.gtf:chr5  .     gene_id "MSTRG.21901"; transcript_id "ENST00000508157"; ref_gene_name "BDP1"; cov "0.0"; FPKM "0.000000"; TPM "0.000000";
X30.gtf:chr5  .     gene_id "MSTRG.21901"; transcript_id "ENST00000514903"; ref_gene_name "BDP1"; cov "0.0"; FPKM "0.000000"; TPM "0.000000";
X30.gtf:chr5  .     gene_id "MSTRG.21901"; transcript_id "ENST00000525844"; ref_gene_name "BDP1"; cov "0.0"; FPKM "0.000000"; TPM "0.000000";
X43.gtf:chr5  .     gene_id "MSTRG.21901"; transcript_id "ENST00000358731"; ref_gene_name "BDP1"; cov "0.0"; FPKM "0.000000"; TPM "0.000000";
X43.gtf:chr5  .     gene_id "MSTRG.21901"; transcript_id "ENST00000380675"; ref_gene_name "BDP1"; cov "0.0"; FPKM "0.000000"; TPM "0.000000";
X43.gtf:chr5  .     gene_id "MSTRG.21901"; transcript_id "ENST00000508917"; ref_gene_name "BDP1"; cov "0.0"; FPKM "0.000000"; TPM "0.000000";
X43.gtf:chr5  .     gene_id "MSTRG.21901"; transcript_id "ENST00000508157"; ref_gene_name "BDP1"; cov "0.0"; FPKM "0.000000"; TPM "0.000000";
X43.gtf:chr5  .     gene_id "MSTRG.21901"; transcript_id "ENST00000514903"; ref_gene_name "BDP1"; cov "0.0"; FPKM "0.000000"; TPM "0.000000";
X43.gtf:chr5  .     gene_id "MSTRG.21901"; transcript_id "ENST00000525844"; ref_gene_name "BDP1"; cov "0.0"; FPKM "0.000000"; TPM "0.000000";
X49.gtf:chr5  .     gene_id "MSTRG.21901"; transcript_id "ENST00000358731"; ref_gene_name "BDP1"; cov "0.0"; FPKM "0.000000"; TPM "0.000000";
X49.gtf:chr5  .     gene_id "MSTRG.21901"; transcript_id "ENST00000380675"; ref_gene_name "BDP1"; cov "0.0"; FPKM "0.000000"; TPM "0.000000";
X49.gtf:chr5  .     gene_id "MSTRG.21901"; transcript_id "ENST00000508917"; ref_gene_name "BDP1"; cov "0.0"; FPKM "0.000000"; TPM "0.000000";
X49.gtf:chr5  .     gene_id "MSTRG.21901"; transcript_id "ENST00000508157"; ref_gene_name "BDP1"; cov "0.0"; FPKM "0.000000"; TPM "0.000000";
X49.gtf:chr5  .     gene_id "MSTRG.21901"; transcript_id "ENST00000514903"; ref_gene_name "BDP1"; cov "0.0"; FPKM "0.000000"; TPM "0.000000";
X49.gtf:chr5  .     gene_id "MSTRG.21901"; transcript_id "ENST00000525844"; ref_gene_name "BDP1"; cov "0.0"; FPKM "0.000000"; TPM "0.000000";

When I visualize the read alignments in IGV, there are many reads aliged with BDP1 in all 5 samples. BDP1.IGV

gpertea commented 6 years ago

Likely a different cause than before, but still something to explore. The score column is only set to 1000 by stringtie for transcripts found to be [comprehensively] covered by reads. I am not sure why in some of your samples stringtie disregarded the reads covering that gene.. so let's find out. Is there any difference between the way the mappings (BAM files) were generated for, say, the X42 vs Ct2 samples? I hope the same program, command line options and the same genome target file was used in both cases, right?

There is one situation which comes to mind right now, where such transcripts' coverage might be disregarded: if the percentage of multi-mapped reads covering such transcripts is above certain value (default is quite high: 95%, which would still be a problem for duplicated genes). It could explain what you observed if, for example, for the X42 sample, the reads which aligned to the BDP1 gene were largely multi-mapped (e.g. if you used a different genome target which included alternate chromosomes and thus some reads would appear multi-mapped for some genes). If it is not that.. I would be really curious to take a closer look at the alignments from that gene region in those two BAM files (X42 vs Ct2) as to find out what are the differences which prevented stringtie from considering any transcripts in that gene in the X42 sample.. In IGV, could you also load the assembled transcripts as inferred by StringTie for that BDP1 region, for these 2 samples (for the first run of stringtie on those samples, before the merge, i.e. without the -e option) ? Perhaps there was some unwanted merge/extension of that BDP1 bundle within some neighboring genes, due to much more noisy alignments in the X42 sample maybe?

zhanxiaoyu commented 6 years ago

I meet the same problem with my data too, with Stringtie v1.3.3b and v1.3.4d. I found several strange things.

  1. When will the second column change to StringTie? Once the tool found the RNA-Seq assembled transcript is the same one as that of in reference? But I found a case that no read covered transcript but with its second column name 'StringTie'

  2. What is the definition of fully-covered transcript in the output file ? I found a case that a transcript with all its exon high coverage in file but not in .

  3. I have two set of result, the only difference between two version of result is with parameter '-e' and the other isn't. The number of transcripts of fully-covered is different. It shouldn't be different, should it?

freedog8 commented 5 years ago

Hi, I have encountered the same problem using cuffdiff. @MassaBob some expression-rich gene have FPKM value 0

OwenDonohoe commented 4 months ago

I am having the exact same issue using v2.2.1. Like the examples above I am using the merged gtf, I can see reads mapping to a specific gene in all samples in IGV. But for some samples, the stringtie output registers no reads mapping to these genes, even when using lowest possible coverage cut-off (-c 0.001). As per above, this only happens when the corresponding sample gtf (i.e individual stringtie output gtf for that sample) lacks the standard "1000" value in the confidence score column for this locus. Has this issue been looked at in v 2.2.2 (latest in as of April 2024 from what I can see)?

zhanxiaoyu commented 4 months ago

您好,我是詹小瑜,您的邮件我已收到。我将尽快给您回复。