gpertea / gffread

GFF/GTF utility providing format conversions, region filtering, FASTA sequence extraction and more
MIT License
374 stars 39 forks source link

Output issues for the -w, -x, and -y options #76

Open sjfleck opened 3 years ago

sjfleck commented 3 years ago

My goal is to create a genome guided transcriptome assembly using Stringtie and use gffread to convert the output GFT into a GFF3. I seem to be able to create the .gff3 file without a problem, but I want to see how complete it is using BUSCO's transcriptome or protein option. It seems like -y might be the best option for that, but I'm having a difficult time getting it to work. I also tried to use the -w and -x options, but only -w worked. Here are my commands:

hisat2-build -p 16 $REF $SAMPLE hisat2 --max-intronlen 20000 -p 16 --dta -x $SAMPLE -1 $READS1 -2 $READS2 -S $SAMPLE.sam samtools sort -@ 16 -o $SAMPLE.bam $SAMPLE.sam stringtie $BAM -o $OUT -p 16 gffread $OUT > $SAMPLE.gff3

At this point, I have a .gff3 that seems to be fine, but when I run:

gffread $SAMPLE.gff3 -g $FASTA -w exons.fa -x cds.fa -y tr_cds.fa

I get a fasta file with spliced exons for each transcript, but cds.fa and tr_cds.fa are both empty. If you have any guidence for getting this to work. Thank you and thank you for creating all these tools.

gpertea commented 3 years ago

StringTie does not output any CDS features (only exon features), which are needed by -x -and -y options of gffread. You might want to run an ORF finder program (e.g. TransDecoder) in order to guess & assign likely CDS features to the StringTie output

sjfleck commented 3 years ago

Thank you for your quick feedback!