nextgenusfs / funannotate

Eukaryotic Genome Annotation Pipeline
http://funannotate.readthedocs.io
BSD 2-Clause "Simplified" License
300 stars 82 forks source link

Final GFF is listing human gene names with random underscores at the end #964

Open jdee3 opened 9 months ago

jdee3 commented 9 months ago

Are you using the latest release? Yes.

Describe the bug I've obtained the final gff after running 'funannotate annotate'. However, I've noticed that a lot of the gene names have underscores in them, when they shouldn't be there. For example, "PNRC2_1", "PNRC2_2", "ACO1_1", "ACO1_2", when the actual genes names should be listed as "PNRC2" and "ACO1". These are human genes, for reference. Can someone please explain what's happening, and how I can get it to print out the gene names properly? Thank you.

What command did you issue?

funannotate annotate -i ./Output --busco_db mammalia --cpus 16 --eggnog ./test.emapper.annotations

OS/Install Information

Checking dependencies for 1.8.15

You are running Python v 3.8.15. Now checking python packages... biopython: 1.76 goatools: 1.3.1 matplotlib: 3.4.3 natsort: 8.4.0 numpy: 1.24.2 pandas: 2.0.0 psutil: 5.9.5 requests: 2.31.0 scikit-learn: 1.3.0 scipy: 1.10.1 seaborn: 0.12.2 All 11 python packages installed

You are running Perl v b'5.032001'. Now checking perl modules... Carp: 1.50 Clone: 0.46 DBD::SQLite: 1.72 DBD::mysql: 4.046 DBI: 1.643 DB_File: 1.858 Data::Dumper: 2.183 File::Basename: 2.85 File::Which: 1.24 Getopt::Long: 2.54 Hash::Merge: 0.302 JSON: 4.10 LWP::UserAgent: 6.67 Logger::Simple: 2.0 POSIX: 1.94 Parallel::ForkManager: 2.02 Pod::Usage: 1.69 Scalar::Util::Numeric: 0.40 Storable: 3.15 Text::Soundex: 3.05 Thread::Queue: 3.14 Tie::File: 1.06 URI::Escape: 5.17 YAML: 1.30 local::lib: 2.000029 threads: 2.25 threads::shared: 1.61 All 27 Perl modules installed

Checking Environmental Variables... $FUNANNOTATE_DB=/home/xxx/funannotate_db $PASAHOME=/home/xxx/miniconda3/envs/funannotate/opt/pasa-2.5.3 $TRINITY_HOME=/home/xxx/miniconda3/envs/funannotate/opt/trinity-2.8.5 $EVM_HOME=/home/xxx/miniconda3/envs/funannotate/opt/evidencemodeler-1.1.1 $AUGUSTUS_CONFIG_PATH=/home/xxx/miniconda3/envs/funannotate/config/ ERROR: GENEMARK_PATH not set. export GENEMARK_PATH=/path/to/dir

Checking external dependencies... PASA: 2.5.3 CodingQuarry: 2.0 Trinity: 2.8.5 augustus: 3.5.0 bamtools: bamtools 2.5.1 bedtools: bedtools v2.31.0 blat: BLAT v37x1 diamond: 2.1.8 emapper.py: 2.1.10 ete3: 3.1.3 exonerate: exonerate 2.4.0 fasta: 36.3.8g glimmerhmm: 3.0.4 gmap: 2023-07-20 hisat2: 2.2.1 hmmscan: HMMER 3.3.2 (Nov 2020) hmmsearch: HMMER 3.3.2 (Nov 2020) java: 17.0.3-internal kallisto: 0.46.1 mafft: v7.520 (2023/Mar/22) makeblastdb: makeblastdb 2.14.1+ minimap2: 2.26-r1175 pigz: 2.6 proteinortho: 6.3.0 pslCDnaFilter: no way to determine salmon: salmon 0.14.1 samtools: samtools 1.17 snap: 2006-07-28 stringtie: 2.2.1 tRNAscan-SE: 2.0.12 (Nov 2022) tantan: tantan 40 tbl2asn: 25.8 tblastn: tblastn 2.14.1+ trimal: trimAl v1.4.rev15 build[2013-12-17] trimmomatic: 0.39 ERROR: gmes_petap.pl not installed ERROR: signalp not installed

nextgenusfs commented 8 months ago

I've obtained the final gff after running 'funannotate annotate'. However, I've noticed that a lot of the gene names have underscores in them, when they shouldn't be there. For example, "PNRC2_1", "PNRC2_2", "ACO1_1", "ACO1_2", when the actual genes names should be listed as "PNRC2" and "ACO1". These are human genes, for reference. Can someone please explain what's happening, and how I can get it to print out the gene names properly? Thank you.

The scripts will do this when there are two homologs present, ie do you have a haploid assembly or are there two copies of some genes?

jdee3 commented 8 months ago

Hi Jon, thanks for the reply!

This a previously uncharacterized diploid mammal species. The transcript support came from a de novo Trinity transcriptome reconstruction, and the protein support was a concatenated, non-redundant fasta with swissprot protein sequences from other mammals. Not sure what you mean by two copies of some genes...how would this arise? From the genome assembly?

I greatly appreciate your help, thank you.

nextgenusfs commented 8 months ago

Is your assembly diploid or haploid was all I was asking. If it's diploid (two copies of each gene) that would explain the behavior.

jdee3 commented 7 months ago

Yes, diploid. How can I circumvent this issue though? Can I just rename the genes, removing the underscores?

nextgenusfs commented 7 months ago

I wouldn't call it an "issue" you can't have two identical gene names I don't think per NCBI rules. Assemblies were haploidized traditionally, but since I don't work on diploids I don't know what the current rules are as technology now allows for phased assemblies.