gpertea / gffcompare

classify, merge, tracking and annotation of GFF files by comparing to a reference annotation GFF

MIT License

198 stars 32 forks source link

I am trying to find novel transcripts from an RNA-seq database. I tried to follow the protocol provided in the Nature Protocols paper that described the use of Stringtie; the paper also suggested using GFFcompare for comparing the assembled transcripts with the reference, to quantify and find the novel transcripts.

I followed the entire protocol, and this was the output of GFFcompare when comparing the reference GTF with that of one of the samples after mapping -

#= Summary for dataset: ./hisat2/ERR188044_chrX.gtf 
#-----------------| Sensitivity | Precision  |
        Base level:    51.7     |    79.5    |
        Exon level:    46.7     |    85.2    |
      Intron level:    47.2     |    93.9    |
Intron chain level:    31.2     |    64.4    |
  Transcript level:    31.6     |    52.3    |
       Locus level:    36.6     |    50.1    |

     Matching intron chains:     580
       Matching transcripts:     664
              Matching loci:     397

          Missed exons:    4395/8804    ( 49.9%)
           Novel exons:     426/4874    (  8.7%)
        Missed introns:    3832/7946    ( 48.2%)
         Novel introns:      83/3992    (  2.1%)
           Missed loci:     610/1086    ( 56.2%)
            Novel loci:     273/795 ( 34.3%)

 Total union super-loci across all input datasets: 749 
1270 out of 1270 consensus transcripts written in 188044_only.annotated.gtf (0 discarded as redundant)

In the paper however, the number of novel genes and transcripts is clearly mentioned -

So, I was wondering how they were able to calculate the number of novel transcripts and genes, from the data that GFFcompare provides (they even mention in the paper that they got it from GFFcompare). I see the numbers related to novel exons and novel introns, but how do I translate it to the number and identities of the novel transcripts/genes?

Hi, I just use this StringTie and also follow their protocol and sample data, but I cannot get the exactly identical result listed in the article. I am not sure if you can get the same result as shown in the article.

We can get that information from .tmap file with the following codes. (I found in other website and with my own notes on there), I hope this will help!

n. assembled genes: $ cat ..tmap | sed "1d" | cut -f4 | sort | uniq | wc -l

sed”1d” = delete first line

cut -f4 = take column 4 only = as it is gene id stated in the documentation

sort = sort the data

uniq = take only unique data

wc -l = count the number of lines in the data

n. novel genes: $ cat ..tmap | awk '$3=="u"{print $0}' | cut -f4 | sort | uniq | wc -l

awk ‘$3==”u”{print $0}’ or awk '$3=="u"' = select and print the line that contain “u” in the column 3, as “u” class code indicate novel, not matching any exon/intron

cut -f4 = take column 4 only = as it is gene id stated in the documentation

n. novel transcripts: $ cat ..tmap | awk '$3=="u"{print $0}' | cut -f5 | sort | uniq | wc -l

cut -f5 = take column 5 only = as it is transcript id stated in the documentation

n. transcripts matching annotation: $ cat ..tmap | awk '$3=="="{print $0}' | cut -f5 | sort | uniq | wc -l

cut -f5 = take column 5 only = as it is transcript id

awk '$3=="="{print $0}' = select and print the line that contain “=” in the column 3, as “=” class code indicate matching

gpertea / gffcompare

Question: How do you find novel transcripts using GFFcompare? #42

sed”1d” = delete first line

cut -f4 = take column 4 only = as it is gene id stated in the documentation

sort = sort the data

uniq = take only unique data

wc -l = count the number of lines in the data

awk ‘$3==”u”{print $0}’ or awk '$3=="u"' = select and print the line that contain “u” in the column 3, as “u” class code indicate novel, not matching any exon/intron

cut -f4 = take column 4 only = as it is gene id stated in the documentation

cut -f5 = take column 5 only = as it is transcript id stated in the documentation

cut -f5 = take column 5 only = as it is transcript id

awk '$3=="="{print $0}' = select and print the line that contain “=” in the column 3, as “=” class code indicate matching