nextgenusfs / funannotate

Eukaryotic Genome Annotation Pipeline
http://funannotate.readthedocs.io
BSD 2-Clause "Simplified" License
301 stars 82 forks source link

ERROR 30: creating acceptor sites. Not enough data or check input files for wrong format. #864

Closed fereyj closed 1 year ago

fereyj commented 1 year ago

Are you using the latest release? I just upgraded to the latest version yesterday (02/09/2023).

Describe the bug When I try to run -predict, there is an error in finding the "full_table_aspergillus_nidulans.tsv" file. It is present in the predict_misc/busco path but not the predict_misc/busco_proteins path.

If I copy "full_table_aspergillus_nidulans.tsv" into the busco_proteins path, I receive another error.

What command did you issue? First command: funannotate predict -i /home/igseq/Documents/Corteva_projects/Corteva_fungi/Original_fasta/DF014B0002_masked -o /home/igseq/Documents/Corteva_projects/Corteva_fungi/Original_fasta/funannotate/ -s "Aspergillus nidulans"

Second command, after copying full_table document: funannotate predict -i /home/igseq/Documents/Corteva_projects/Corteva_fungi/Original_fasta/DF014B0002_masked -o /home/igseq/Documents/Corteva_projects/Corteva_fungi/Original_fasta/funannotate/ -s "Aspergillus nidulans"

Logfiles After first command:

[Feb 09 10:15 AM]: OS: Ubuntu 22.04, 8 cores, ~ 33 GB RAM. Python: 3.8.15 [Feb 09 10:15 AM]: Running funannotate v1.8.14 [Feb 09 10:15 AM]: GeneMark not found and $GENEMARK_PATH environmental variable missing. Will skip GeneMark ab-initio prediction. [Feb 09 10:15 AM]: Skipping CodingQuarry as no --rna_bam passed [Feb 09 10:15 AM]: Parsed training data, run ab-initio gene predictors as follows: Program Training-Method augustus pretrained
glimmerhmm busco
snap busco
[Feb 09 10:15 AM]: Loading genome assembly and parsing soft-masked repetitive sequences [Feb 09 10:15 AM]: Genome loaded: 1 scaffolds; 77,675 bp; 1.36% repeats masked [Feb 09 10:15 AM]: Mapping 555,918 proteins to genome using diamond and exonerate [Feb 09 10:17 AM]: Found 2,528 preliminary alignments with diamond in 0:01:07 --> generated FASTA files for exonerate in 0:00:07 Progress: 2528 complete, 0 failed, 0 remaining
[Feb 09 10:18 AM]: Exonerate finished in 0:01:33: found 11 alignments [Feb 09 10:18 AM]: Running BUSCO to find conserved gene models for training ab-initio predictors [Feb 09 10:18 AM]: 0 valid BUSCO predictions found, validating protein sequences Traceback (most recent call last): File "/home/igseq/miniconda3/envs/funannotate/bin/funannotate", line 8, in sys.exit(main()) File "/home/igseq/miniconda3/envs/funannotate/lib/python3.8/site-packages/funannotate/funannotate.py", line 716, in main mod.main(arguments) File "/home/igseq/miniconda3/envs/funannotate/lib/python3.8/site-packages/funannotate/predict.py", line 1359, in main buscoProtComplete = lib.getCompleteBuscos(buscoProtOutput, File "/home/igseq/miniconda3/envs/funannotate/lib/python3.8/site-packages/funannotate/library.py", line 5401, in getCompleteBuscos with open(input, 'r') as infile: FileNotFoundError: [Errno 2] No such file or directory: '/home/igseq/Documents/Corteva_projects/Corteva_fungi/Original_fasta/funannotate/predict_misc/busco_proteins/run_aspergillus_nidulans/full_table_aspergillus_nidulans.tsv'

After second command:

[Feb 09 10:35 AM]: OS: Ubuntu 22.04, 8 cores, ~ 33 GB RAM. Python: 3.8.15 [Feb 09 10:35 AM]: Running funannotate v1.8.14 [Feb 09 10:35 AM]: GeneMark not found and $GENEMARK_PATH environmental variable missing. Will skip GeneMark ab-initio prediction. [Feb 09 10:35 AM]: Skipping CodingQuarry as no --rna_bam passed [Feb 09 10:35 AM]: Parsed training data, run ab-initio gene predictors as follows: Program Training-Method augustus pretrained
glimmerhmm busco
snap busco
[Feb 09 10:35 AM]: Loading genome assembly and parsing soft-masked repetitive sequences [Feb 09 10:35 AM]: Genome loaded: 1 scaffolds; 77,675 bp; 1.36% repeats masked [Feb 09 10:35 AM]: Existing protein alignments found: /home/igseq/Documents/Corteva_projects/Corteva_fungi/Original_fasta/funannotate/predict_misc/protein_alignments.gff3 [Feb 09 10:35 AM]: Existing BUSCO results found: /home/igseq/Documents/Corteva_projects/Corteva_fungi/Original_fasta/funannotate/predict_misc/busco.final.gff3 containing 0 predictions [Feb 09 10:35 AM]: Existing Augustus annotations found: /home/igseq/Documents/Corteva_projects/Corteva_fungi/Original_fasta/funannotate/predict_misc/augustus.gff3 [Feb 09 10:35 AM]: Pulling out high quality Augustus predictions [Feb 09 10:35 AM]: Found 0 high quality predictions from Augustus (>90% exon evidence) [Feb 09 10:35 AM]: Existing snap predictions found /home/igseq/Documents/Corteva_projects/Corteva_fungi/Original_fasta/funannotate/predict_misc/snap-predictions.gff3 [Feb 09 10:35 AM]: 0 predictions from SNAP [Feb 09 10:35 AM]: SNAP prediction failed, moving on without result [Feb 09 10:35 AM]: Running GlimmerHMM gene prediction, using training data: /home/igseq/Documents/Corteva_projects/Corteva_fungi/Original_fasta/funannotate/predict_misc/busco.final.gff3 [Feb 09 10:35 AM]: CMD ERROR: trainGlimmerHMM /home/igseq/Documents/Corteva_projects/Corteva_fungi/Original_fasta/funannotate/predict_misc/genome.softmasked.fa /home/igseq/Documents/Corteva_projects/Corteva_fungi/Original_fasta/funannotate/predict_misc/glimmer.exons -d /home/igseq/Documents/Corteva_projects/Corteva_fungi/Original_fasta/funannotate/predict_misc/glimmerhmm [Feb 09 10:35 AM]: ERROR 30: creating acceptor sites. Not enough data or check input files for wrong format.

OS/Install Information

Checking dependencies for 1.8.14

You are running Python v 3.8.15. Now checking python packages... biopython: 1.80 goatools: 1.2.3 matplotlib: 3.4.3 natsort: 8.2.0 numpy: 1.24.1 pandas: 1.5.3 psutil: 5.9.4 requests: 2.28.2 scikit-learn: 1.2.1 scipy: 1.10.0 seaborn: 0.12.2 All 11 python packages installed

You are running Perl v b'5.032001'. Now checking perl modules... Carp: 1.50 Clone: 0.46 DBD::SQLite: 1.72 DBD::mysql: 4.050 DBI: 1.643 DB_File: 1.855 Data::Dumper: 2.183 File::Basename: 2.85 File::Which: 1.24 Getopt::Long: 2.54 Hash::Merge: 0.302 JSON: 4.10 LWP::UserAgent: 6.67 Logger::Simple: 2.0 POSIX: 1.94 Parallel::ForkManager: 2.02 Pod::Usage: 1.69 Scalar::Util::Numeric: 0.40 Storable: 3.15 Text::Soundex: 3.05 Thread::Queue: 3.14 Tie::File: 1.06 URI::Escape: 5.12 YAML: 1.30 local::lib: 2.000029 threads: 2.25 threads::shared: 1.61 All 27 Perl modules installed

Checking Environmental Variables... $FUNANNOTATE_DB=/home/igseq/Documents/funannotate_db $PASAHOME=/home/igseq/miniconda3/envs/funannotate/opt/pasa-2.5.2 $TRINITY_HOME=/home/igseq/miniconda3/envs/funannotate/opt/trinity-2.8.5 $EVM_HOME=/home/igseq/miniconda3/envs/funannotate/opt/evidencemodeler-1.1.1 $AUGUSTUS_CONFIG_PATH=/home/igseq/miniconda3/envs/funannotate/config/ ERROR: GENEMARK_PATH not set. export GENEMARK_PATH=/path/to/dir

Checking external dependencies... ERROR: pslDnaFiler found but error running: pslCDnaFilter: error while loading shared libraries: libssl.so.1.0.0: cannot open shared object file: No such file or directory

PASA: 2.5.2 CodingQuarry: 2.0 Trinity: 2.8.5 augustus: 3.5.0 bamtools: bamtools 2.5.1 bedtools: bedtools v2.30.0 blat: BLAT v35 diamond: 2.0.15 ete3: 3.1.2 exonerate: exonerate 2.4.0 fasta: 36.3.8g glimmerhmm: 3.0.4 gmap: 2021-08-25 hisat2: 2.2.1 hmmscan: HMMER 3.3.2 (Nov 2020) hmmsearch: HMMER 3.3.2 (Nov 2020) java: 17.0.3-internal kallisto: 0.46.1 mafft: v7.515 (2023/Jan/15) makeblastdb: makeblastdb 2.2.31+ minimap2: 2.24-r1122 pigz: 2.6 proteinortho: 6.1.7 salmon: salmon 0.14.1 samtools: samtools 1.16.1 snap: 2006-07-28 stringtie: 2.2.1 tRNAscan-SE: 2.0.11 (Oct 2022) tantan: tantan 40 tbl2asn: 25.8 tblastn: tblastn 2.2.31+ trimal: trimAl v1.4.rev15 build[2013-12-17] trimmomatic: 0.39 ERROR: emapper.py not installed ERROR: gmes_petap.pl not installed ERROR: pslCDnaFilter not installed ERROR: signalp not installed

nextgenusfs commented 1 year ago

Did you re-build the database after upgrading funannotate?

fereyj commented 1 year ago

I had not before posting the question. I tried updating the database with "funannotate setup -i all --update" and when that didn't work, I deleted the "funannotate_db" folder and reinstalled the database with "funannotate setup -i all" but that did not work either.

nextgenusfs commented 1 year ago

And you tried then a clean output directory? The error here is that BUSCO failed, this might be related to Augustus install and the internal BUSCO code. I assume you get the same error then in the test suite, ie funannotate test -t busco --cpus 12?

hyphaltip commented 1 year ago

yeah this looks like classic BUSCO failure, your config messages indicate augustus=3.5 but I think you need to downgrade to augustus=3.4

The logfiles or the logfiles inside the busco dir might give you some clues but I think this is probably the issue

nextgenusfs commented 1 year ago

@hyphaltip this has been my outstanding question, whether the current master code is functional with Augustus v3.5 -- I tried to make it work but I've not been able to build a local way to test it on my Mac.

nextgenusfs commented 1 year ago

Eventually we will move to this codebase for BUSCO https://github.com/nextgenusfs/buscolite (I re-wrote it for simplification to use only the parts we care about), but it is not integrated yet.

nextgenusfs commented 1 year ago

But basically since the pptx module was patched in Augustus v3.5 -- we need to make it work with current version of funannotate so we can pin a functional version in the conda recipe.

fereyj commented 1 year ago

I tried "funannotate test -t busco --cpus 12" and that worked. However, I still can't get my own fasta file to work. I have attached two log files, in case either of those are helpful.

I will downgrade to 3.4 and see if it works now.

augustus.log busco.log

nextgenusfs commented 1 year ago

Oh, is this your "genome", 77 kb and 1 contig? Its failing because it isn't finding any of the busco models?

[Feb 09 10:35 AM]: Genome loaded: 1 scaffolds; 77,675 bp; 1.36% repeats masked

nextgenusfs commented 1 year ago

admittedly it should fail/error out better than that. But if you are trying to just annotate a single contig, you need to have a pre-trained species in your database. For example, you can type funannotate species to see what is available. BUSCO is needed to generate a training set for automatic training of glimmerHMM and snap -- so that's why it is trying to run. Probably could improve the code with some more error checks for this case.

fereyj commented 1 year ago

Oh yes, sorry, I should have clarified I am not trying to annotate a full genome (although I plan to do so in the future). I thought I had included a pre-trained species with the 's- "Aspergillus nidulans"' command, but is that not the case?

nextgenusfs commented 1 year ago

So it would only be able to run Augustus because you haven't trained a model for the other gene callers. That sort of defeats the purpose of funannotate utilizing EVM and multiple sources of an initio predictions. The work around would be to run it on a closely related full genome, when it's finished it will output a parameters.json file which you can then either install into the database to be used again (ie how you are trying to annotate a single contig) or you can pass the parameters file at runtime which will then use those training settings.

fereyj commented 1 year ago

If I understood this correctly, I need to run predict on a genome with the command $ funannotate predict -i fungal genome. I then run predict on the contig again with $ funannotate predict -i single contig -s "fungal species" -p parameters.json. Is that right?

And this might be beyond the scope of the initial question, but I have 7 contigs (all with BGCs identified by a collaborator). As I mentioned, I am getting an error when I submit the first one for prediction. However, if I combine the 7 contigs into one masked fasta file, then funannotate will annotate all 7 contigs, including the one that caused an error when submitted alone. Is that because it is recognized as a multi-fasta file, which the instructions state should be the input?

nextgenusfs commented 1 year ago

Its not an issue with single contig, its an issue that in order to generate a training set of gene models for the ab initio predictors funannotate uses BUSCO. BUSCO finds conserved gene models found across the fungal tree of life and it needs to find >= 200 gene models in order to train the gene predictors. You will not have likely any of those models if you are working with contigs containing gene clusters, so even if you combined 5000 BGC's it will likely fail at the training stage because it is unable to find those general/conserved BUSCO gene models.

For example if I run funannotate species on my dev instance it will show me the training sets available, not by default it parses the ones that are shipped with augustus.

$ funannotate species
  Species                                    Augustus               GeneMark          Snap             GlimmerHMM       CodingQuarry   Date      
  test_genome                                augustus pre-trained   None              None             None             None           2019-10-24
  test_species                               augustus pre-trained   selftraining ES   BUCSCO dikarya   BUCSCO dikarya   None           2019-10-24
  yeast                                      augustus pre-trained   None              None             None             None           2019-10-24
  zebrafish                                  augustus pre-trained   None              None             None             None           2022-10-18

Options for this script:
 To print a parameter file to terminal:
   funannotate species -p myparameters.json
 To print the parameters details from a species in the database:
   funannotate species -s aspergillus_fumigatus
 To add a new species to database:
   funannotate species -s new_species_name -a new_species_name.parameters.json

So since I have a training set for "test_species", I could call funannotate predict like this and it would just run without running busco/training:

funannotate predict -i genome.fasta -s "Genus species" --augustus_species test_species -o annotate

If its not installed in your database, that's fine then you can use the parameters.json file directly, pass full path to the -p argument.

fereyj commented 1 year ago

I had an entire fungal genome training overnight and it finally finished. I added it to species and it shows it is trained, like with "BUCSCO dikarya", the same as the test_species listed above. I then used it to try the command you suggested. It seemed like it was going to work and then it returned that "/predict_misc/hints.all.tmp" could not be found. I found that error in a previous question here and was able to get a successful prediction after using the "touch /predict_misc/hints.all.tmp" method. It's an easy fix, but should I be concerned that it is compromising my results?

And I'm a little confused as to what the difference is between the -s and --augustus_species options, since funannotate species seems to list the augustus_species....

nextgenusfs commented 1 year ago

https://funannotate.readthedocs.io/en/latest/predict.html#explanation-of-inputs-and-options

Without more context I have no idea what the "predict_misc/hints.all.tmp" error is -- its related to augustus....

fereyj commented 1 year ago

Oh okay, so the -s option doesn't affect the function, it only affects the naming of the output files.

The full context of the error is below. I used the solution suggested here to "force it." https://github.com/nextgenusfs/funannotate/issues/482

~$ funannotate predict -i Masked_fasta/DF936Z0027-0008_masked.fasta --augustus_species aspergillus_nidulans -s "Aspergillus nidulans" -o Masked_fasta/DF936Z0027-0008_annotate

[Feb 10 03:46 PM]: OS: Ubuntu 22.04, 8 cores, ~ 33 GB RAM. Python: 3.8.15 [Feb 10 03:46 PM]: Running funannotate v1.8.14 [Feb 10 03:46 PM]: GeneMark not found and $GENEMARK_PATH environmental variable missing. Will skip GeneMark ab-initio prediction. [Feb 10 03:46 PM]: Skipping CodingQuarry as no --rna_bam passed [Feb 10 03:46 PM]: Parsed training data, run ab-initio gene predictors as follows: Program Training-Method augustus pretrained
glimmerhmm busco
snap busco
[Feb 10 03:46 PM]: Loading genome assembly and parsing soft-masked repetitive sequences [Feb 10 03:46 PM]: Genome loaded: 1 scaffolds; 106,338 bp; 2.61% repeats masked [Feb 10 03:46 PM]: Mapping 555,918 proteins to genome using diamond and exonerate [Feb 10 03:47 PM]: Found 1,436 preliminary alignments with diamond in 0:01:01 --> generated FASTA files for exonerate in 0:00:01 Progress: 1436 complete, 0 failed, 0 remaining
[Feb 10 03:48 PM]: Exonerate finished in 0:00:50: found 0 alignments Traceback (most recent call last): File "/home/igseq/miniconda3/envs/funannotate/bin/funannotate", line 8, in sys.exit(main()) File "/home/igseq/miniconda3/envs/funannotate/lib/python3.8/site-packages/funannotate/funannotate.py", line 716, in main mod.main(arguments) File "/home/igseq/miniconda3/envs/funannotate/lib/python3.8/site-packages/funannotate/predict.py", line 1083, in main lib.sortHints(allhintstmp, allhintstmp_sort) File "/home/igseq/miniconda3/envs/funannotate/lib/python3.8/site-packages/funannotate/library.py", line 8825, in sortHints with open(input, 'r') as infile: FileNotFoundError: [Errno 2] No such file or directory: 'Masked_fasta/DF936Z0027-0008_annotate/predict_misc/hints.all.tmp'

nextgenusfs commented 1 year ago

I think the above error should be fixed in latest, try to update and see if this hints error is gone:

python -m pip install git+https://github.com/nextgenusfs/funannotate.git --upgrade --force --no-deps

Next I'll try to fix the initial BUSCO 0 models error to exit properly.

fereyj commented 1 year ago

Yes, this update works at preventing the hints error. For what I am trying to do, everything seems to be working now. I can't thank you enough for all of your help, thank you so much.

abdo3a commented 3 weeks ago

Hello, i had the same error with GlimmerHMM gene prediction, this happened after using nanopore RNA long transcripts evidence. Here is my log file. funannotate-predict.log