gpertea / gffread

GFF/GTF utility providing format conversions, region filtering, FASTA sequence extraction and more
MIT License
353 stars 39 forks source link

Gffread to extract transcript sequences from transcript assembly GTF file #42

Open Aishah17 opened 4 years ago

Aishah17 commented 4 years ago

I have a transcript assembly gtf file (filename.gtf) generated by StringTie. To find the number of assembled sequences, I used the command: cut -f9 filename.gtf | cut -d ' ' -f4 | uniq | sort -u | wc -l This gives me 85371.

Now I want to extract all the sequences from filename.gtf into a fasta file (filename.fa) with gffread, and I used the command: gffread filename.gtf -w filename.fa -g my_reference_genome.fa The run completes & I get my filename.fa file. I check the number of sequences in filename.fa, using the command: grep '>' filename.fa | wc -l This gives me 85369. I was expecting to get 85371 as above. Is it bad news that I am getting two different counts or is that normal as gffread discards some sequences?

gpertea commented 4 years ago

The GTF generated by StringTie usually has two header lines that might account for the difference of 2 that you noticed here. This would be easy to verify if you take a look at the last few entries you get by using your cut-pipes there :), something like this:

cut -f9 filename.gtf | cut -d ' ' -f4 | uniq | sort -u | tail

As a side note, I hope you realize that's not the proper way to get the list of transcripts in a GTF, because the order of the attributes in a GTF is not guaranteed to be the same. It's better to actually parse the transcript_id attributes there -- for example, using a perl regex like in this gtfcount script: https://github.com/gpertea/gscripts/blob/master/gtfcount (there is also a "gffcount" script there doing the same for GFF3 formatted files).

Aishah17 commented 4 years ago

Thanks a lot for your prompt and very helpful response gpertea. The two header lines in the GTF file are actually what I didn't think about. Great to have highlighted your additional scripts :) Issue solved!