Closed milesroberts-123 closed 1 year ago
So I can replicate this exactly, though one thing I will point out is that degenotate looks for either "transcript" or "mRNA" and "CDS" (not "exon"). But this problem is happening when the transcripts are read so we just haven't gotten to any issues with your switch from "CDS" to "exon".
Basically, when we just swap the "gene" string for the "transcript" or "mRNA" string in column 3 of the gff, this leads to a host of problems when parsing the parent ID of the transcript which doesn't exist since it was annotated as a "gene" feature -- normally the parent of a transcript is a gene. degenotate uses this mainly to determine longest transcript per gene, I think, but this check for the parent ID is kind of baked in to the gff parsing code so it will take a bit of effort to get it out such that this can be disabled.
The real solution is for GFFs to be formatted in a more standard way. I have no idea why this one doesn't have transcript features. Of course this isn't a helpful solution for your problem at hand. Unless @tsackton has any other suggestions, this will take a bit of re-factoring this code to solve. I could see us either having an option to disable the parent look-up for transcripts if the user doesn't care about longest transcripts or, probably less ideally, another read-through of the gff to determine if it has "transcript" features at all, and if not to fall back to "gene" features as transcripts with the parent look-up disabled.
Hi Miles,
Looking into this a bit more, as far as I can tell this is an invalid GFF file, as I don't think CDS features can technically have the gene feature as their parent.
I think the solution here is to fix the GFF before running degenotate. There are several possible ways one could probably do this. One option might be to remove the gene features, and try running gffread to rebuild the transcript/gene structure from just CDS inputs. You might also look at agat which has some GFF cleanup utilities, or pygtftk.
Hope this helps!
Tim
Hello, I also encountered this problem. And I don't know how to solve it. I also tried using gffread to convert gff3 to gff/gtf format, but it did not work. Could you please provide an example demonstrating the format of gff files?
Here is a standard definition of GFF files. The key for degenotate is that you have gene, transcript and CDS features annotated.
This is the script: python /home/tiawu/miniconda3/envs/degenotate/bin/degenotate.py --overwrite -a sp0013_hap2.gff3 -g sp0013_hap2.fasta -o ./sp0013.
However, it comes with errors:
06.26.2024 06:06:18 Detecting compression of annotation file Success: No compression detected 0.94491 0.00122
06.26.2024 06:06:18 Reading transcripts Success: 18543 transcripts read 2.04724 1.10094
06.26.2024 06:06:20 Reading coding exons Success: 104185 coding exons read 2.79886 0.74911
06.26.2024 06:06:20 Detecting compression of genome FASTA file Success: No compression detected 2.81253 0.0111
06.26.2024 06:06:20 Reading genome FASTA file Success: 76 seqs read 7.19428 4.37993
06.26.2024 06:06:25 Checking headers Success 7.2036 0.00755
06.26.2024 06:06:25 Extracting CDS In progress... Traceback (most recent call last):
File "/home/tiawu/miniconda3/envs/degenotate/bin/degenotate.py", line 68, in
I attached some examples of my gff3 files here. Could you help me to check what the problem is it? Thanks!
When I used gffread to transform the gff3 format to the gff format, it also didn't work, it came with the same errors as above. the format of gff is list below:
Script: python /home/tiawu/miniconda3/envs/degenotate/bin/degenotate.py --overwrite -a sp0013_hap2.gff -g sp0013_hap2.fasta -o ./sp0013
Errors: 06.26.2024 06:24:44 Detecting compression of annotation file Success: No compression detected 0.05874 0.00942 06.26.2024 06:24:44 Reading transcripts In progress... 0 [] ['ID=model.evm.Chr01.1', 'geneID=gene.evm.Chr01.1', 'gene_name=Augustus%20prediction'] ['Chr01', 'Augustus', 'mRNA', '7180', '7784', '.', '-', '.', 'ID=model.evm.Chr01.1;geneID=gene.evm.Chr01.1;gene_name=Augustus%20prediction']
It looks like your problem is that you don’t have frame annotated. Degenotate requires CDS features with properly set reading frame, otherwise it is impossible to tell what is a 1st, 2nd, or 3rd position in a codon.
Hello,
I got interesting errors when testing degenotate v1.1.2 on the assembly and annotation for Eleusine coracana, which you can download here.
The command:
The output:
I think the problem is that this GFF variant has only "gene" and "CDS" features, whereas I think degenotate looks for "mRNA" and "exon" features, correct? I tried to fix this error by changing genes and CDS to mRNA and exons, respectively
sed 's/\tgene/\tmRNA/' data/annotations/eleusine_coracana.gff3 | sed 's/\tCDS/\texon/' > tmp.gff3
But I then got a different error about ID parsing
The new command:
python degenotate.py --overwrite -d " " -a tmp.gff3 -g data/assemblies/eleusine_coracana.fa -o data/degenotateOutput/eleusine_coracana
The new output:
I'm not quite sure what this error refers to because degenotate should be able to properly construct transcripts by looking at the matching ID and Parent tags of the mRNA and exon features, respectively. But maybe I'm missing something about how degenotate behaves.
Anyway, I wanted to post here because I didn't see either of these errors come up in the issue section yet.
Thanks in advance for your help! And let me know if you need more info!