Open dleopold opened 1 year ago
Hi dleopold,
Thank you very much for reporting the issue. I will check it when I have more time.
I agree with your point "It would seem more appropriate to stop at the same position as the reference if possible, particularly if the reference is highly similar" if the annotation of reference is correct (but in most cases, we don't know the truth).
MitoZ checks the initial CDS ranges from the file DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.cds.position
(https://github.com/linzhi2013/MitoZ/wiki/Tutorial#3-the-annotation-step), and tries to find the stop codons or T/TA endings.
So, we can look at the file DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.cds.position
to confirm that the annotation hits the reference mitogenome (NC_030597). If it hits another reference, we can understand why it happened like now.
Secondly, if it hits the reference genome (NC_030597), the gene length information might start to have an effect on the annotation process (the COX2 on NC_030597 might be too short). You can compare the COX2 on NC_030597 with the COX2 of other closely related species, do a multiple sequence alignment (codon-based version), and see if all species end at the same position.
In this case, the "...position.cds" file shows that the sequence being annotated is hitting itself in the reference database for COX2. Here is the relevant part of the file:
>gi_NC_030597_COX2_Sargocentron_spiniferum_230_aa_Ref1:230aa [mRNA] locus=NC_030597.1:7206:7895:+
ATGGCACATCCCTCTCAACTAGGATTCCAAGATGCGGCTTCACCCGTTAT
AGAAGAGCTCCTTCACTTCCACGACCACGCTTTAATAATCGTCTTTCTAA
TTAGCACACTAGTTCTTTACATTATTGTGGCGATAGTCTCCACTAAACTA
ACCAACAAATATATCCTCGACTCCCAAGAAATCGAAATTATCTGAACAGT
ACTCCCTGCAGTAATTCTTATCCTAATTGCCCTCCCCTCACTACGAATTC
TTTATCTTATGGATGAAATTAATGACCCACACCTAACTATTAAAGCAATA
GGACACCAATGATACTGAAGCTACGAATATACTGATTACGAGGATCTTGG
CTTCGACTCTTATATAATTCCTACCCAAGACCTTACCCCAGGACAATTCC
GCCTCCTAGAAGCAGACCATCGAATAGTTATCCCAATTGAATCCCCTATT
CGTGTTCTAGTCTCAGCCGAAGACGTCCTACACTCATGAGCAGTTCCAGC
ACTAGGCGTTAAAATAGACGCAGTGCCTGGCCGACTAAACCAAACAGCCT
TTATTACATCCCGCCCAGGTGTATTCTACGGTCAATGCTCCGAAATCTGC
GGCGCAAACCACAGCTTTATACCCATCGTCGTTGAAGCTGTCCCACTAGA
ACACTTTGAAAACTGATCCTCTATAATACTTGAAGACGCT
However, although the alignment occurs at positions 7206:7895, the final annotation of the gene is 7206:7916, extending 19bp into the tRNA-Lys despite a T/TA in the 'correct' stop position. As far as I can tell this is not due to an issue with length - I downloaded mitogenomes for 4 similar taxa and they all align very well (codon or nucleotide based alignment) and end at the same T or TA. They are also all the same length, only differing by 1 bp if the stop codon was identified as T or TA.
I do realize that we often do not know the "correct" position when working with novel taxa, but I chose this Ref Seq mitogenome as a reproducible example of something that I am seeing quite frequently in my data (~800 fish mitogenomes). To me, this seems like a clear case where the stop position in the original annotation should be properly re-identified by MitoZ. Perhaps I am missing something about how length enters into the delineation of the start/stop positions? I did not see anything about that in the original paper or documentation, so I am just trying to understand why the annotation is incorrect.
Could you please tell me the accessions for the "4 similar taxa"? thanks! I would like to check them further.
When I check this COX2 of NC_030597.1, I do find a standard stop codon after the extension (the first line below). We can check (at the protein level) if this is also true for the other 4 similar taxa. If all taxa show a similar pattern (conserved proteins at the extended region), then we can say probably that the NCBI Ref annotation for this gene is wrong.
Perhaps in the future, we can add an option to ask MitoZ not to do such a greedy extension if the users don't want, to make the annotation result more similar to the reference gene (although it's not necessary to be correct).
It looks like you compared the results with a similar set of reference sequences as the ones I used. Thank you for looking into it. I don't have any additional data to determine which is the 'correct' annotation. However, it appears that for this gene, in the Holocentridae, the NCBI annotations are fairly consistent, all ending before the neighboring tRNA. This is also consistent with annotations produced by other annotation software (e.g. Mitos2, MitoAnnotator). I have also had many of my NCBI submissions rejected when the annotation differs from all of the annotations of currently accepted submissions in this way (for this and other taxonomic groups / genes). So, whether or not there is uncertainty about which is correct, there does seem to be a prevailing consensus.
Thanks for your information.
Maybe we can try to first determine the boundaries of tRNA genes and then set some constraints on the boundaries of PCGs, to avoid the overlapping of PCGs and tRNAs. (However, I remember that at least of some PCGs of some clades like ATP6 and ATP8 genes, the overlapping of genes happens, right?)
But anyway, if the goal of studies is for phylogenetic analysis using PCGs, or gene order rearrangement analysis, such kind of issues don't matter at all as far as I can see.
MitoZ version?
3.4
How did you install MitoZ? (e.g. Docker, Udocker, Singularity, Conda-Pack, Conda, or source code)
Conda
Did you run a test after your installation, and was the test run okay?
Yes.
The command you used?
e.g., mitoz annotate --outprefix NC_030597 --clade Chordata --fastafiles NC_030597.fasta
Problem description
I have been using MitoZ to annotate a large number of fish mitogenomes. I have noticed that the stop codons are sometimes not properly identified and the gene annotation improperly extends into the next gene/feature. This primarily happens when the stop codons are incomplete (T/TA), and an alternative potnential stop codon could be identified further upstream. It also happens even when there is a highly similar, or even identical reference. For example, I have downloaded the complete fasta for NC_030597 (Sargocentron spiniferum) from genbank, which is also included in the Chordata PCG reference database for MitoZ. I then run MitoZ to re-annotate it. The resulting annotation for COX2 extends 20 bp beyond the (identical) reference and into the neighboring tRNA-Lys. It would seem more appropriate to stop at the same position as the reference if possible, particularly if the reference is highly similar and the annotation would otherwise extend into another element on the same strand.
I have also noticed similar behavior for start codons, though less often.
Log messages from MitoZ (stdout and stderr)
NA