virus-evolution / gofasta

MIT License
31 stars 1 forks source link

Unsupported chars in reference id? and accepting GFF3 filename extension #44

Open mmokrejs opened 7 months ago

mmokrejs commented 7 months ago

Hi, I wonder why - are not allowed in my reference. I had to use _ to overcome this.

$ ./gofasta sam variants -a 7-WU-FF1.gff -s foo.sorted.sam
Error: Error parsing gff SeqID: 7-WU-FF
$ ./gofasta variants -a 7-WU-FF1.gff -r 7-WU-FF --msa 7-WU-FF.fa
Error: Error parsing gff SeqID: 7-WU-FF

Please improve the error message to make it clear what is source of the error. The reference used to create the SAM file used 7-WU-FF and likewise the gff3 contains 7-WU-FF.

The 7-WU-FF1.gff file contains:

7-WU-FF ignored_field   gene    979 1440    .   +   .   ID=gene:S;biotype=protein_coding;Name=S
7-WU-FF ignored_field   transcript  979 1440    .   +   .   ID=transcript:43740568;Parent=gene:S;biotype=protein_coding
7-WU-FF ignored_field   exon    979 1440    .   +   .   Parent=transcript:43740568
7-WU-FF ignored_field   CDS 979 1440    .   +   0   Parent=transcript:43740568

BTW, I had to rename my file from .gff3 to .gff to get rid of:

./gofasta sam variants -a 7-WU-FF1.gff3 -s foo.sorted.sam
Error: couldn't tell if --annotation was a .gb or a .gff file

Please relax the check if this is just about the filename extension match. Thank you.

mmokrejs commented 7 months ago

BTW, I had to add Name=S into the 9th column of the CDS line to make gofasta happy. Here is from your docs:

_For the purposes of annotating amino acids, CDS or mature_protein_region_ofCDS feature lines that have a Name=something tag,value pair in the attributes column (column 9) will be represented in the output.

Why doesn't it infer the Name= from the Parent=?

benjamincjackson commented 7 months ago
mmokrejs commented 7 months ago
  • Can you provid input files for the first comment in this issue as attachments please, so that it is possible to replicate the error?

testcases.aln.txt testcases.sam.txt 7-WU-FF1.fasta.txt or replace Ns with say a for https://github.com/virus-evolution/gofasta/issues/46 7-WU-FF.gff.txt

  I'll think about changing the file extension check, but it's more parsimonious for the user just to change their filename.

Other tools happily accept .gff3 so that was as exactly why I wanted to keep the filename as it is and avoid an extra symlink or renaming.

* I'm not sure what you mean by "to make gofasta happy", but the purpose of the `Name=something` parsing convention was so that things with a "Name=x" tab would be reported in the output, whereas things without wouldn't be. In this way is is possible to annotate protein coding features in your gff for the purposes of defining non-protein-coding (intronic, intergenic, synonymous) nucleotide changes, without having to also have every amino acid change in every gene in the output.

I write it now based on my memory but I think gofasta skipped reporting the output if the Name= was unset. So I had to edit my .gff file to get it working.

  This usage was intended to be somewhat coherent with the [gff version 3 specifications](https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md).
  Does this also make it clearer why Name is not inherited from a Parent feature? (which isn't coherent with the gff spec, I don't think).

I will study that later, cannot tell straight away.