gpertea / stringtie

Transcript assembly and quantification for RNA-Seq
MIT License
364 stars 77 forks source link

Majority of FPKMs are zero #6

Closed srithegreat closed 9 years ago

srithegreat commented 9 years ago

Hi,

I am using the NCBI GRCH38 containing chromosome names as NC_.xxx to which I aligned my RNA-Seq data (101bp paired-end). I have used the annotation gff with same chromosome IDs, but with the -b option, all my FPKM values are zero. I was wondering if it has anything to do with the annotation file?

Srikanth

gpertea commented 9 years ago

Could you please confirm and clarify:

When either of -b/-B options are used, all the transcripts given in the reference annotation file will be reported in the *.ctab files, not just the "expressed" ones. Since the majority of those reference transcripts are not expressed, their FPKMs will be written as 0.000000, so the t_data.ctab file will have a lot of these zero FPKMs, but not all of them should be zero..

It's rather unusual to have genome indexes and annotation using the NC_* accessions instead of the more meaningful chromosome numbers/names.. That should not be a problem for StringTie, I am just saying that maybe it is worth double checking that the chromosome names in the .BAM header do indeed match the ones in the annotation file..

srithegreat commented 9 years ago

Thanks for the reply. I am using the latest version. I figured out the cause for zero FPKMs. It was because the library files I was using was a stranded library and I did not align initially with proper strandedness. I see that Stringtie does not have the library type argument anymore. When I re-aaligned the data with correct strandedness with HISAT and then re-ran StringTie, now I see non-zero FPKMs.

Regards, Srikanth

On Thu, Mar 26, 2015 at 10:37 PM, Geo Pertea notifications@github.com wrote:

Could you please confirm and clarify:

  • you are running at least v1.0.1 ? I think only prior to the v1.0 release we had a bug that would zero the FPKMs when Ballgown output was enabled..
  • in what output file you see these zero FPKM values ? (t_data.ctab, the output transcripts GTF or both?)
  • when you say "my FPKM values", are you referring to some specific target transcripts that you know are expressed in the sample but have their FPKM reported as zero only when you use the -b option ? In other words, without the -b option, are the FPKM values non-zero for the same transcripts? (because that would eliminate your doubts about the annotation file as the cause for this anomaly).

When either of -b/-B options are used, all the transcripts given in the reference annotation file will be reported in the _.ctab files, not just the "expressed" ones. Since the majority of those reference transcripts are not expressed, their FPKMs will be written as 0.000000, so the tdata.ctab file will have a lot of these zero FPKMs, but not *all of them should be zero..

It's rather unusual to have genome indexes and annotation using the NC_* accessions instead of the more meaningful chromosome numbers/names.. That should not be a problem for StringTie, I am just saying that maybe it is worth double checking that the chromosome names in the .BAM header do indeed match the ones in the annotation file..

— Reply to this email directly or view it on GitHub https://github.com/gpertea/stringtie/issues/6#issuecomment-86791630.

Srikanth S. Manda Research Scholar Pandey Lab McKusick-Nathans Institute of Genetic Medicine Johns Hopkins University School of Medicine Miller Research Building, Room 560 733 North Broadway Baltimore, Maryland 21205