nextgenusfs / funannotate

Eukaryotic Genome Annotation Pipeline
http://funannotate.readthedocs.io
BSD 2-Clause "Simplified" License
301 stars 82 forks source link

FUN gene ID numbering seems inconsistent #898

Open DaRinker opened 1 year ago

DaRinker commented 1 year ago

I'm seeing inconsistent behavior in the default gene numbering in my annotations. For example: ` $ grep FUN_ DTO15.proteins.fa | head

DTO15_FUN_000001-T1 locus=DTO15_FUN_000001 DTO15_FUN_000002-T1 locus=DTO15_FUN_000002 DTO15_FUN_000003-T1 locus=DTO15_FUN_000003 DTO15_FUN_000004-T1 locus=DTO15_FUN_000004 DTO15_FUN_000005-T1 locus=DTO15_FUN_000005 DTO15_FUN_000006-T1 locus=DTO15_FUN_000006 DTO15_FUN_000007-T1 locus=DTO15_FUN_000007 DTO15_FUN_000008-T1 locus=DTO15_FUN_000008 DTO15_FUN_000009-T1 locus=DTO15_FUN_000009 DTO15_FUN_000010-T1 locus=DTO15_FUN_000010

$ grep FUN_ DTO15.proteins.fa | tail

DTO15_FUN_12106-T1 locus=DTO15_FUN_12106 DTO15_FUN_12107-T1 locus=DTO15_FUN_12107 DTO15_FUN_12108-T1 locus=DTO15_FUN_12108 DTO15_FUN_12109-T1 locus=DTO15_FUN_12109 DTO15_FUN_12110-T1 locus=DTO15_FUN_12110 DTO15_FUN_12111-T1 locus=DTO15_FUN_12111 DTO15_FUN_12112-T1 locus=DTO15_FUN_12112 DTO15_FUN_12113-T1 locus=DTO15_FUN_12113 DTO15_FUN_12114-T1 locus=DTO15_FUN_12114 DTO15_FUN_12115-T1 locus=DTO15_FUN_12115

` To me, the 5-digit gene ID's (without the leading zero) seem strange (especially given the fact that I ALSO have some gene IDs with numbers >10,000 in the formant "DTO15_FUN_011977-T1"

This makes the output sometimes hard to parse. Is this some sort of bug/error or is it meaningful? Will I be messing anything up if I attempt to add back in any "missing" zeros using a custom script?

This was all done with funannotate 1.8.14 The commands I used in my workflow (in order) were: `funannotate clean

funannotate sort

funannotate mask

funannotate train

funannotate predict

funannotate update

funannotate annotate`

nextgenusfs commented 1 year ago

Not sure how this is possible. Except use only a single underscore and it should pad to the same number of digits every time. So if you pass --name DTO15 then it will generate the gene names with the locus tag of DTO15 and then an underscore followed by sequential numbering.

DaRinker commented 1 year ago

I just checked my output directories. This phenomena is NOT present in the predict_results outputs but it IS present in all the annotate outputs (_results and _misc). I can trace the same gene going from having the zero padded value to losing the zero. Does that narrow down the possibilities at all?

nextgenusfs commented 1 year ago

Did you run funannotate update?

The names do not/shouldn't change. What happens when the PASA mediated update script is run is that it tries to determine the last numeric gene model in your existing annotation -- it does this by splitting at underscore and then resumes counting from there if additional gene models are predicted/added by PASA. But it is slightly more complicated because PASA adds another locus_tag naming issue to deal with in the update script. There are various different parts of the code base where it is necessary to be able to parse gene model names reliably so it could be happening in a number of places. As I said, just run predict without an extra underscore and re-run the rest of the scripts (you'll need to remove results downstream of predict and re-run) and it should all be fixed.

I used to have strict requirements on the -n,--name option but then people complained. I probably should not have given in.....

DaRinker commented 1 year ago

Yes, I did run funanotate update. (My ordered list of commands is above).

I ran the same analysis over 16 genomes and they all show some of these odd ball numberings.

I'm using a singularity image (if that could be relevant)

nextgenusfs commented 1 year ago

Ah yes, sorry I was responding on my phone and with a toddler yelling for my attention at me didn't see the whole post.

DaRinker commented 1 year ago

What I'm hearing is that this is not intended behavior and that the missing padding is not meant to signify anything.

If that's the correct I will go back an add in the missing zeros post-annotation.

nextgenusfs commented 1 year ago

Sure, I did push a change yesterday to update that might fix the PASA related numbering, naming. But funannotate should be producing sequential "justified" numerical gene names, there should not be differences in padding. The default is to produce a 6 character integer, which should be enough gene models for nearly every genome.

Its not the easiest thing to rename the gene models in the outputs, as there are numerous files that this would effect. You'd perhaps save yourself time in the long run to just re-run it....