harvardinformatics / degenotate

MIT License
41 stars 4 forks source link

Error GXF1: Invalid number of IDs found during transcript parent id parsing AND Error GXF2: No CDS exons found in input annotation file! Cannot calculate degeneracy without coding sequences #33

Closed milesroberts-123 closed 1 year ago

milesroberts-123 commented 1 year ago

Hello,

I got interesting errors when testing degenotate v1.1.2 on the assembly and annotation for Eleusine coracana, which you can download here.

The command:

python degenotate.py --overwrite -d " " -a data/annotations/eleusine_coracana.gff3 -g data/assemblies/eleusine_coracana.fa -o data/degenotateOutput/eleusine_coracana

The output:

/mnt/home/robe1195/anaconda3/envs/degenotate/bin/degenotate.py --overwrite -d   -a data/annotations/eleusine_coracana.gff3 -g data/assemblies/eleusine_coracana.fa -o data/degenotateOutput/eleusine_coracana

#
# =============================================================================================================================

    |                               |           |
,---| ,---. ,---. ,---. ,---. ,---. |---  ,---. |---  ,---.
|   | |---' |   | |---' |   | |   | |     ,---| |     |---'
`---' `---' `---| `---' `   ' `---' `---' `---^ `---' `---'
            `---'
            Degeneracy annotation of transcripts

#
# Welcome to degenotate -- Degeneracy annotation for coding transcripts.
# Version 1.1.2 released on January 2023
# degenotate was developed by Timothy Sackton, Gregg Thomas
# Website:       https://github.com/harvardinformatics/degenotate
#
# The date and time at the start is: 04.20.2023 | 11:04:48
# Using Python version:              3.10.8
#
# The program was called as:         /mnt/home/robe1195/anaconda3/envs/degenotate/bin/degenotate.py --overwrite -d   -a data/annotations/eleusine_coracana.gff3 -g data/assemblies/eleusine_coracana.fa -o data/degenotateOutput/eleusine_coracana
#
# -----------------------------------------------------------------------------------------------------------------------------
# INPUT/OUTPUT INFO:
# Annotation file:                      /mnt/ufs18/home-010/robe1195/Josephs_Lab_Projects/tajimasDacrossSpecies/workflow/data/annotations/eleusine_coracana.gff3
# Annotation file type:                 gff
# Genome file:                          /mnt/ufs18/home-010/robe1195/Josephs_Lab_Projects/tajimasDacrossSpecies/workflow/data/assemblies/eleusine_coracana.fa
# Output directory:                     data/degenotateOutput/eleusine_coracana
# Per-site degeneracy output:           data/degenotateOutput/eleusine_coracana/degeneracy-all-sites.bed
# Transcript count output:              data/degenotateOutput/eleusine_coracana/transcript-counts.tsv
# Log file:                             eleusine_coracana.log
# -----------------------------------------------------------------------------------------------------------------------------
# OPTIONS INFO:
# Option                                Current setting               Current action
# -d                                    " "                           degenotate will split FASTA headers at this character.
# --overwrite                           True                          degenotate will OVERWRITE the existing files in the specified output directory.
# --quiet                               False                         Time, memory, and status info will be printed to the screen while degenotate is running.
# ------------------------------------------------------------------------------------------------------------------------------------------------------
# Date        Time      Current step                                      Status                                  Elapsed time (s)    Step time (s)
# ------------------------------------------------------------------------------------------------------------------------------------------------------
# 04.20.2023  11:04:48  Detecting compression of annotation file          Success: No compression detected        0.07164             0.04884
# 04.20.2023  11:04:48  Reading transcripts                               Success: 0 transcripts read             0.84133             0.76865
# 04.20.2023  11:04:49  Reading coding exons                              Success: 0 coding exons read            2.52587             1.6836

----------------------------------------------------------------------------------------------------------------
**Error GXF2: No CDS exons found in input annotation file! Cannot calculate degeneracy without coding sequences.
----------------------------------------------------------------------------------------------------------------

Script call: /mnt/home/robe1195/anaconda3/envs/degenotate/bin/degenotate.py --overwrite -d   -a data/annotations/eleusine_coracana.gff3 -g data/assemblies/eleusine_coracana.fa -o data/degenotateOutput/eleusine_coracana

I think the problem is that this GFF variant has only "gene" and "CDS" features, whereas I think degenotate looks for "mRNA" and "exon" features, correct? I tried to fix this error by changing genes and CDS to mRNA and exons, respectively

sed 's/\tgene/\tmRNA/' data/annotations/eleusine_coracana.gff3 | sed 's/\tCDS/\texon/' > tmp.gff3

But I then got a different error about ID parsing

The new command:

python degenotate.py --overwrite -d " " -a tmp.gff3 -g data/assemblies/eleusine_coracana.fa -o data/degenotateOutput/eleusine_coracana

The new output:

/mnt/home/robe1195/anaconda3/envs/degenotate/bin/degenotate.py --overwrite -d   -a tmp.gff3 -g data/assemblies/eleusine_coracana.fa -o data/degenotateOutput/eleusine_coracana

#
# =============================================================================================================================

    |                               |           |
,---| ,---. ,---. ,---. ,---. ,---. |---  ,---. |---  ,---.
|   | |---' |   | |---' |   | |   | |     ,---| |     |---'
`---' `---' `---| `---' `   ' `---' `---' `---^ `---' `---'
            `---'
            Degeneracy annotation of transcripts

#
# Welcome to degenotate -- Degeneracy annotation for coding transcripts.
# Version 1.1.2 released on January 2023
# degenotate was developed by Timothy Sackton, Gregg Thomas
# Website:       https://github.com/harvardinformatics/degenotate
#
# The date and time at the start is: 04.20.2023 | 11:11:44
# Using Python version:              3.10.8
#
# The program was called as:         /mnt/home/robe1195/anaconda3/envs/degenotate/bin/degenotate.py --overwrite -d   -a tmp.gff3 -g data/assemblies/eleusine_coracana.fa -o data/degenotateOutput/eleusine_coracana
#
# -----------------------------------------------------------------------------------------------------------------------------
# INPUT/OUTPUT INFO:
# Annotation file:                      /mnt/ufs18/home-010/robe1195/Josephs_Lab_Projects/tajimasDacrossSpecies/workflow/tmp.gff3
# Annotation file type:                 gff
# Genome file:                          /mnt/ufs18/home-010/robe1195/Josephs_Lab_Projects/tajimasDacrossSpecies/workflow/data/assemblies/eleusine_coracana.fa
# Output directory:                     data/degenotateOutput/eleusine_coracana
# Per-site degeneracy output:           data/degenotateOutput/eleusine_coracana/degeneracy-all-sites.bed
# Transcript count output:              data/degenotateOutput/eleusine_coracana/transcript-counts.tsv
# Log file:                             eleusine_coracana.log
# -----------------------------------------------------------------------------------------------------------------------------
# OPTIONS INFO:
# Option                                Current setting               Current action
# -d                                    " "                           degenotate will split FASTA headers at this character.
# --overwrite                           True                          degenotate will OVERWRITE the existing files in the specified output directory.
# --quiet                               False                         Time, memory, and status info will be printed to the screen while degenotate is running.
# ------------------------------------------------------------------------------------------------------------------------------------------------------
# Date        Time      Current step                                      Status                                  Elapsed time (s)    Step time (s)
# ------------------------------------------------------------------------------------------------------------------------------------------------------
# 04.20.2023  11:11:44  Detecting compression of annotation file          Success: No compression detected        0.02601             0.00363
# 04.20.2023  11:11:44  Reading transcripts                               In progress...

0
[]
['ID=gene-PR202ga00001', 'Name=ga00001', 'gbkey=Gene', 'gene=ga00001', 'genebiotype=proteincoding', 'locustag=PR202ga00001']
['BQKI01000001.1', 'DDBJ', 'mRNA', '84805', '85236', '.', '+', '.', 'ID=gene-PR202ga00001;Name=ga00001;gbkey=Gene;gene=ga00001;genebiotype=proteincoding;locustag=PR202ga00001']

-----------------------------------------------------------------------------
**Error GXF1: Invalid number of IDs found during transcript parent id parsing
-----------------------------------------------------------------------------

Script call: /mnt/home/robe1195/anaconda3/envs/degenotate/bin/degenotate.py --overwrite -d   -a tmp.gff3 -g data/assemblies/eleusine_coracana.fa -o data/degenotateOutput/eleusine_coracana

I'm not quite sure what this error refers to because degenotate should be able to properly construct transcripts by looking at the matching ID and Parent tags of the mRNA and exon features, respectively. But maybe I'm missing something about how degenotate behaves.

Anyway, I wanted to post here because I didn't see either of these errors come up in the issue section yet.

Thanks in advance for your help! And let me know if you need more info!

gwct commented 1 year ago

So I can replicate this exactly, though one thing I will point out is that degenotate looks for either "transcript" or "mRNA" and "CDS" (not "exon"). But this problem is happening when the transcripts are read so we just haven't gotten to any issues with your switch from "CDS" to "exon".

Basically, when we just swap the "gene" string for the "transcript" or "mRNA" string in column 3 of the gff, this leads to a host of problems when parsing the parent ID of the transcript which doesn't exist since it was annotated as a "gene" feature -- normally the parent of a transcript is a gene. degenotate uses this mainly to determine longest transcript per gene, I think, but this check for the parent ID is kind of baked in to the gff parsing code so it will take a bit of effort to get it out such that this can be disabled.

The real solution is for GFFs to be formatted in a more standard way. I have no idea why this one doesn't have transcript features. Of course this isn't a helpful solution for your problem at hand. Unless @tsackton has any other suggestions, this will take a bit of re-factoring this code to solve. I could see us either having an option to disable the parent look-up for transcripts if the user doesn't care about longest transcripts or, probably less ideally, another read-through of the gff to determine if it has "transcript" features at all, and if not to fall back to "gene" features as transcripts with the parent look-up disabled.

tsackton commented 1 year ago

Hi Miles,

Looking into this a bit more, as far as I can tell this is an invalid GFF file, as I don't think CDS features can technically have the gene feature as their parent.

I think the solution here is to fix the GFF before running degenotate. There are several possible ways one could probably do this. One option might be to remove the gene features, and try running gffread to rebuild the transcript/gene structure from just CDS inputs. You might also look at agat which has some GFF cleanup utilities, or pygtftk.

Hope this helps!

Tim

wutian1217 commented 4 months ago

Hello, I also encountered this problem. And I don't know how to solve it. I also tried using gffread to convert gff3 to gff/gtf format, but it did not work. Could you please provide an example demonstrating the format of gff files?

tsackton commented 4 months ago

Here is a standard definition of GFF files. The key for degenotate is that you have gene, transcript and CDS features annotated.

wutian1217 commented 4 months ago

This is the script: python /home/tiawu/miniconda3/envs/degenotate/bin/degenotate.py --overwrite -a sp0013_hap2.gff3 -g sp0013_hap2.fasta -o ./sp0013.
However, it comes with errors:
06.26.2024 06:06:18 Detecting compression of annotation file Success: No compression detected 0.94491 0.00122 06.26.2024 06:06:18 Reading transcripts Success: 18543 transcripts read 2.04724 1.10094 06.26.2024 06:06:20 Reading coding exons Success: 104185 coding exons read 2.79886 0.74911 06.26.2024 06:06:20 Detecting compression of genome FASTA file Success: No compression detected 2.81253 0.0111 06.26.2024 06:06:20 Reading genome FASTA file Success: 76 seqs read 7.19428 4.37993 06.26.2024 06:06:25 Checking headers Success 7.2036 0.00755 06.26.2024 06:06:25 Extracting CDS In progress... Traceback (most recent call last): File "/home/tiawu/miniconda3/envs/degenotate/bin/degenotate.py", line 68, in globs = SEQ.extractCDS(globs); ^^^^^^^^^^^^^^^^^^^^^ File "/home/tiawu/miniconda3/envs/degenotate/lib/python3.12/site-packages/degenotate_lib/seq.py", line 183, in extractCDS globs['annotation'][transcript]['start-frame'] = int(exon_phase[first_exon_genome_start]) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ValueError: invalid literal for int() with base 10: '.'

I attached some examples of my gff3 files here. Could you help me to check what the problem is it? Thanks! gff3

When I used gffread to transform the gff3 format to the gff format, it also didn't work, it came with the same errors as above. the format of gff is list below: gff

Script: python /home/tiawu/miniconda3/envs/degenotate/bin/degenotate.py --overwrite -a sp0013_hap2.gff -g sp0013_hap2.fasta -o ./sp0013

Errors: 06.26.2024 06:24:44 Detecting compression of annotation file Success: No compression detected 0.05874 0.00942 06.26.2024 06:24:44 Reading transcripts In progress... 0 [] ['ID=model.evm.Chr01.1', 'geneID=gene.evm.Chr01.1', 'gene_name=Augustus%20prediction'] ['Chr01', 'Augustus', 'mRNA', '7180', '7784', '.', '-', '.', 'ID=model.evm.Chr01.1;geneID=gene.evm.Chr01.1;gene_name=Augustus%20prediction']


Error GXF1: Invalid number of IDs found during transcript parent id parsing

tsackton commented 4 months ago

It looks like your problem is that you don’t have frame annotated. Degenotate requires CDS features with properly set reading frame, otherwise it is impossible to tell what is a 1st, 2nd, or 3rd position in a codon.