nextgenusfs / funannotate

Eukaryotic Genome Annotation Pipeline
http://funannotate.readthedocs.io
BSD 2-Clause "Simplified" License
302 stars 82 forks source link

BUSCO not completing #236

Closed PlantDr430 closed 4 years ago

PlantDr430 commented 5 years ago

Using Funannotate version 1.5.1

I continue to get this warning in my BUSCO logs when I try to run Funannotate.

INFO ** Phase 1 of 2, initial predictions ** INFO ** Step 1/3, current time: 11/22/2018 06:32:47 ** INFO Create blast database... INFO [makeblastdb] Building a new DB, current time: 11/22/2018 06:32:47 INFO [makeblastdb] New DB name: /data/wyka/funannotate/LM470/busco/tmp/claviceps_purpurea_lm470_1288123762 INFO [makeblastdb] New DB title: /data/wyka/funannotate/LM470/LM470_fun_output/predict_misc/genome.softmasked.fa INFO [makeblastdb] Sequence type: Nucleotide INFO [makeblastdb] Keep MBits: T INFO [makeblastdb] Maximum file size: 1000000000B INFO [makeblastdb] Adding sequences from FASTA; added 1797 sequences in 0.41661 seconds. INFO Running tblastn, writing output to /data/wyka/funannotate/LM470/busco/run_claviceps_purpurea_lm470/blast_output/tblastn_claviceps_purpurea_lm470.tsv... WARNING tblastn might have ended prematurely (the result file lacks the expected final line), which could produce incomplete results in the next steps ! INFO ** Step 2/3, current time: 11/22/2018 06:33:15 ** INFO Getting coordinates for candidate regions... INFO Pre-Augustus scaffold extraction... INFO Running Augustus prediction using Claviceps_purpurea as species: INFO [augustus] Please find all logs related to Augustus here: /data/wyka/funannotate/LM470/busco/run_claviceps_purpurea_lm470/augustus_output/augustus.log INFO 11/22/2018 06:33:16 => 0% of predictions performed (959 to be done)

The command I am using is this: Don't mind the %s as this command is part of a batch script.

nice -n 19 /data/wyka/funannotate-master/funannotate predict -i %s_masked.fasta -o %s_fun_output -s "%s" --isolate %s --protein_evidence /data/wyka/Pconf.fasta --cpus 24 --busco_seed_species Claviceps_purpurea --busco_db sordariomyceta_odb9 --name %s_FUN --optimize_augustus --soft_mask 1000 --min_protlen 100 --other_gff %s_snap.gff3:1 %s_cegma.gff3:1

I never used to get this problem before, but I read online somewhere that tblastn has difficulty with multi-threading. Although, I've always multi-threaded in the past without any problems.

Have you come across this problem before?

Or do you know of a way I can run BUSCO separately and then pass the results to the Funannotate command to bypass the BUSCO call, but still train AUGUSTUS through Funannotate?

nextgenusfs commented 5 years ago

Yes it’s the tblastn multithreading error - as far as I know this is a problem in blast+ all the way back to like 2.2. There is no error it just silently dies. I usually install downgrade tblastn to 2.2.31 in my funannotate environment.

PlantDr430 commented 5 years ago

Hmm okay, I'll have to do that. You would have thought that they would have a fix by 2.7+.... Oh well, thanks for the information.

On Thu, Nov 22, 2018 at 7:28 AM Jon Palmer notifications@github.com wrote:

Yes it’s the tblastn multithreading error - as far as I know this is a problem in blast+ all the way back to like 2.2. There is no error it just silently dies. I usually install downgrade tblastn to 2.2.31 in my funannotate environment.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/nextgenusfs/funannotate/issues/236#issuecomment-441046859, or mute the thread https://github.com/notifications/unsubscribe-auth/AcPRP3ROc7ClNrUA2GlUn0ylpwDP42_vks5uxrSlgaJpZM4YvZ5j .

nextgenusfs commented 5 years ago

Yeah I know. I assume they know it’s a problem but it isnt easy/transparent to open issues with NCBI. The error clearly isn’t normal as it seems to fail somewhat randomly - as I thought I remember it running to completion occasionally. I did notice the problem in 2.7.

PlantDr430 commented 5 years ago

Yea it is random. Sometimes it goes to completion and other times it doesn't. Can't find any method to the madness, except that when I ran your modified BUSCO2 script prior to Funannotate for training SNAP, I have better results using 10 threads. About 75% went to completion and I had to redo the ones that failed until all eventually passed.

On Thu, Nov 22, 2018 at 7:47 AM Jon Palmer notifications@github.com wrote:

Yeah I know. I assume they know it’s a problem but it isnt easy/transparent to open issues with NCBI. The error clearly isn’t normal as it seems to fail somewhat randomly - as I thought I remember it running to completion occasionally. I did notice the problem in 2.7.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/nextgenusfs/funannotate/issues/236#issuecomment-441051822, or mute the thread https://github.com/notifications/unsubscribe-auth/AcPRP7VqUl5rhHwv6VLE-_oL1p9NBSUfks5uxrj2gaJpZM4YvZ5j .

nextgenusfs commented 5 years ago

I tried awhile ago to replace tblastn with diamond but I could never get the same level of accuracy. It only uses it as a prelimary screen to find regions to run Augustus on.

PlantDr430 commented 5 years ago

That's a bummer. Maybe they will eventually get a fix for it, but we will see. On another note, have you thought about adding SNAP as another ab inito gene prediction program into the Funannotate pipeline? Or are you more of a proponent of using specific or less ab inito programs?

On Thu, Nov 22, 2018 at 7:55 AM Jon Palmer notifications@github.com wrote:

I tried awhile ago to replace tblastn with diamond but I could never get the same level of accuracy. It only uses it as a prelimary screen to find regions to run Augustus on.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/nextgenusfs/funannotate/issues/236#issuecomment-441054032, or mute the thread https://github.com/notifications/unsubscribe-auth/AcPRP06mH1kt2sz8LpiJzXWQU1O6WQm_ks5uxrrZgaJpZM4YvZ5j .

nextgenusfs commented 5 years ago

I suppose I could detect the tblastn version and run it single threaded if greater than version 2.2.31, downside would obviously be speed but at least it would be accurate.

Yeah snap is something I could add and/or GlimmerHMM. If you want to open a new issue with a request I can look into it - not sure when I will have time but can get it done. I didn’t include it because I don’t like how snap produces a lot of gene models with introns less than 10 bp, which aren’t considered “real” by genbank so they need to be filtered out. It also tended to produce split or fragmented predictions on my test data. Snap is actually one of the main reasons that many Maker predictions fail NCBI submission checks. Having said that, I now have a bunch of filtering in place where I don’t think it will be an issue anymore. Evidencemodeler works best to give it as many ab initio models as possible so I think it could help.

PlantDr430 commented 5 years ago

Yea, that is an option. Although a single thread for me when doing 54 genomes would take a while, but I guess if you just just do the single thread for the BUSCO run and then revert it back to the multi-threading after it wouldn't be so bad.

Your use of Evidencemodeler is what had me try to pass as many ab inito models as possible. Although, SNAP seems to be difficult to get to the correct file formats. After running BUSCO of my genomes I had to write a script to parse the AUGUSTUS gffs from BUSCO into CEGMA gff format, then use the cegma2zff converter from MAKER to run SNAP, and then had to use two different scripts that I found online to turn the SNAP output into gff3 forma for me to pass it into Funannotate without a problem. So in the end it is possible, but might be a lot of work. Although, I only starting programming this year so you might be able to do it in less steps. GlimmerHMM might be a better option.

On Thu, Nov 22, 2018 at 8:29 AM Jon Palmer notifications@github.com wrote:

I suppose I could detect the tblastn version and run it single threaded if greater than version 2.2.31, downside would obviously be speed but at least it would be accurate.

Yeah snap is something I could add and/or GlimmerHMM. If you want to open a new issue with a request I can look into it - not sure when I will have time but can get it done. I didn’t include it because I don’t like how snap produces a lot of gene models with introns less than 10 bp, which aren’t considered “real” by genbank so they need to be filtered out. It also tended to produce split or fragmented predictions on my test data. Snap is actually one of the main reasons that many Maker predictions fail NCBI submission checks. Having said that, I now have a bunch of filtering in place where I don’t think it will be an issue anymore. Evidencemodeler works best to give it as many ab initio models as possible so I think it could help.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/nextgenusfs/funannotate/issues/236#issuecomment-441063195, or mute the thread https://github.com/notifications/unsubscribe-auth/AcPRPwuxiMEPoRDtcvn3gw9REQB_FsXPks5uxsLbgaJpZM4YvZ5j .

nextgenusfs commented 5 years ago

Yeah it would only run tblastn for busco single threaded — blast isn’t used anywhere else in current pipeline. But could modify the internal busco script to just detect which version of tblastn and set threads for that step appropriately.

I have some converters for the zff format somewhere so likely most of that code exists. Basically would just use the busco results to train snap/glimmer — this would be superseded if RNA seq data exists as would be better to use PASA models to train.

PlantDr430 commented 5 years ago

Yea, I agree if RNA seq data is used the PASA models would be better for training. Just thought that since BUSCO is already being run and SNAP is fairly quick it might improve the pipeline overall. It already is a great pipeline, and a lot easier to run than MAKER, and has a nice .gbk file at the end for submission! I'll open a ticket as a reminder.

On Thu, Nov 22, 2018 at 8:54 AM Jon Palmer notifications@github.com wrote:

Yeah it would only run tblastn for busco single threaded — blast isn’t used anywhere else in current pipeline. But could modify the internal busco script to just detect which version of tblastn and set threads for that step appropriately.

I have some converters for the zff format somewhere so likely most of that code exists. Basically would just use the busco results to train snap/glimmer — this would be superseded if RNA seq data exists as would be better to use PASA models to train.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/nextgenusfs/funannotate/issues/236#issuecomment-441070002, or mute the thread https://github.com/notifications/unsubscribe-auth/AcPRP0VJg2gaWCY8Ghhkk6SBP_bG3Cmaks5uxsi5gaJpZM4YvZ5j .

nextgenusfs commented 5 years ago

This commit https://github.com/nextgenusfs/funannotate/commit/1e7a0906b2015b673db8d9125f96dcb35dec7f51 will check for tblastn version and set to single threaded in BUSCO if greater than version 2.2.31.