gpertea / gffread

GFF/GTF utility providing format conversions, region filtering, FASTA sequence extraction and more
MIT License
365 stars 39 forks source link

gffread, Error: no genomic sequence provided! #34

Closed Quicken-up closed 5 years ago

Quicken-up commented 5 years ago

Hello, I am getting this error when I try to run gffread on genemark-es output using a command like this: gffread genemark.gtf -g test.fna -x proteins_test

The fna is a multiple sequence FASTA, and the fai is created during the gffread run before the error. Full command output:

No fasta index found for test.fna. Rebuilding, please wait.. Fasta index rebuilt. Warning: couldn't find fasta record for 'KI925007.1|kraken:taxid|1036723 Plasmodium falciparum Vietnam Oak-Knoll (FVO) unplaced genomic scaffold supercont1.1, whole genome shotgun sequence'! Error: no genomic sequence provided!

head -n 3 test.fna

KI925007.1|kraken:taxid|1036723 Plasmodium falciparum Vietnam Oak-Knoll (FVO) unplaced genomic scaffold supercont1.1, whole genome shotgun sequence GACGACGACGACGACGAAGACGAAGATGACGAAGGCAAAGTCGAGGCGGCGAAGGAAGAC CAGGTGGACAGGAAGGGGGAAACGGCAAAGGAGGAGGAACCACCGGCATCACAAAACGAT

head -n 3 genemark.gtf KI925302.1|kraken:taxid|1036723 Plasmodium falciparum Vietnam Oak-Knoll (FVO) unplaced genomic scaffold supercont1.296, whole genome shotgun sequence GeneMark.hmm exon 2031 2915 0 + . gene_id "1_g"; transcript_id "1_t"; KI925302.1|kraken:taxid|1036723 Plasmodium falciparum Vietnam Oak-Knoll (FVO) unplaced genomic scaffold supercont1.296, whole genome shotgun sequence GeneMark.hmm start_codon 2031 2033 . + 0 gene_id "1_g"; transcript_id "1_t"; KI925302.1|kraken:taxid|1036723 Plasmodium falciparum Vietnam Oak-Knoll (FVO) unplaced genomic scaffold supercont1.296, whole genome shotgun sequence GeneMark.hmm CDS 2031 2915 . + 0 gene_id "1_g"; transcript_id "1_t";

head -n 3 test.fna.fai KI925007.1|kraken:taxid|1036723 424099 149 60 61 KI925008.1|kraken:taxid|1036723 100393 431466 60 61 KI925009.1|kraken:taxid|1036723 22759 533682 60 61

Any help appreciated

Quicken-up commented 5 years ago

It appears gffread cannot handle spaces in the headers. Fixed

gpertea commented 5 years ago

It's not that it cannot, it just doesn't have or want to handle spaces within sequence IDs. The basic assumption has always been that sequence IDs do not have spaces - for example the first space in the FASTA header is supposed to mark the end of that ID, when the header is parsed, the rest of the line is just part of additional info/data about that particular sequence ID.

And when it comes to GFF files it does seem rather wasteful to have reference sequence IDs containing full worded descriptions (!) repeated over and over in the 1st column.. Never seen anything like that before, and I suspect gffread is not the first program that will refuse to handle them.