nextgenusfs / funannotate

Eukaryotic Genome Annotation Pipeline
http://funannotate.readthedocs.io
BSD 2-Clause "Simplified" License
320 stars 84 forks source link

Running PASA annotation comparison step 1 using only one core #243

Closed EarlyEvol closed 5 years ago

EarlyEvol commented 5 years ago

hi Jon,

I am running funannotate update following predict on three different species. For one species with 19K gene models it finished in a total of 8 hours. For the others with more gene models 37K it ran for 24 hours and never got past "Running PASA annotation comparison step 1" and when I looked at the cpu usage for the node, it was 0%.

I killed it and restarted, now it is still on the same comparison step and using just half of a cpu for Launch_PASA_pipeline.pl. Is this just inherently a bottleneck in the pipe, or has something gone wrong?

funannotate-update.log shows that Launch_PASA_pipeline.pl was called with --CPU 27.

I just rechecked the cpu usage and it is down to 6-15%. I'm kind of at a lose here, it seems like PASA was called with the correct parameters, so I dont know what could be slowing slowing it up.

Thanks, Earl

nextgenusfs commented 5 years ago

Yes this is an issue with PASA and the sqlite3 database, I think the developers have turned off multi threads in the newer versions. Using MySQL might be faster? I’m seeing the same thing with update taking a day or two to complete.

EarlyEvol commented 5 years ago

Dang. Have you tried both mysql and sqlite? It it is faster, might be worth it to just run PASA again with mysql. The only reason I used sqlite was because it looked like lots of people had trouble getting PASA to work with mysql.

nextgenusfs commented 5 years ago

I have not tried MySQL on our cluster as I didn’t want to go through the pain of getting it setup. It was hard enough to setup on my Mac where I had control over everything.... you could search the issues on the PASA page and see if there is a solution or not.

EarlyEvol commented 5 years ago

Looking at their site now. Let you know what I find.

EarlyEvol commented 5 years ago

Here is some info from Brian. It looks like there aren't really any faster options for sqlite.

https://github.com/PASApipeline/PASApipeline/issues/94

nextgenusfs commented 5 years ago

Thanks for looking into it. Going to leave this open so easier for others to find.

lixx3627 commented 5 years ago

Hi Jon,

I'm also trying to run funannotate update step after funannotate predict was done. I had ~37k gene models predicted from the prediction step. In the first trial of funannotate update, I was running funannotate update -i fun --cpus 20, and it ran ~20 hours and exited with error showing "cannot find program fasta" as follows:

* Running CMD: /panfs/roc/msisoft/pasa/2.3.3//scripts/cDNA_annotation_comparer.dbi -G /path/to/update_misc/genome.fa --CPU 20 -M '/path/to/training/pasa/Pgt_21_0'  > /path/to/pasa_run.log.dir/Pgt_21_0.annotation_compare.21871.out
Use of uninitialized value $prog in scalar chomp at /panfs/roc/msisoft/pasa/2.3.3//PerlLib/fasta.ph line 70.
Thread 32 terminated abnormally: Cannot find program fasta
ERROR, thread 32 exited with error Cannot find program fasta

Error, there were 1 threads (contig jobs) that failed...  See error messages above in order to troubleshoot furtherFailed thread (32) info:
Thread(32)      FAILED  CMD: unknown    Time to complete: 14356 seconds
Error, cmd: /panfs/roc/msisoft/pasa/2.3.3//scripts/cDNA_annotation_comparer.dbi -G /path/to/whole_assembly_only/update_misc/genome.fa --CPU 20 -M '/path/to/training/pasa/Pgt_21_0'  > pasa_run.log.dir/Pgt.annotation_compare.21871.out died with ret 256 No such file or directory at /panfs/roc/msisoft/pasa/2.3.3//PerlLib/Pipeliner.pm line 186.
        Pipeliner::run(Pipeliner=HASH(0xe89218)) called at /panfs/roc/msisoft/pasa/2.3.3/Launch_PASA_pipeline.pl line 1044
[12/20/18 16:07:23]: PASA failed, check log, exiting

But we do have fasta36 installed within funannotate pipeline and it seems started "update step" already because I can see some of the contigs were analyzed for UTRs in the pasa_run.log.dir but the program stopped after ~20 hour. Thus I think for this run, it got started at "Running PASA annotation comparison step 1" but got killed in the middle of the analysis.

When I try to restart the funannotate update step again and troubleshoot, and the job was killed immediately when it started "Running PASA annotation comparison step 1", not like my first trial. Then this time the error is about "database is locked":

DBD::SQLite::db do failed: database is locked at /panfs/roc/msisoft/pasa/2.3.3//PerlLib/DB_connect.pm line 221.
failed query: <insert into annotation_admin (date) values (CURRENT_TIMESTAMP)>  values: 
Errors: database is locked
 at /panfs/roc/msisoft/pasa/2.3.3//PerlLib/DB_connect.pm line 233.
        DB_connect::RunMod(DB_connect=HASH(0x23902f8), "insert into annotation_admin (date) values (CURRENT_TIMESTAMP)") called at /panfs/roc/msisoft/pasa/2.3.3//scripts/Annotation_store_preloader.dbi line 52
Sorry, couldn't retrieve an ID for the annotation version. :( 

There's a link in pasa issues about this (https://github.com/PASApipeline/PASApipeline/issues/74) and it suggested to use cpu = 1 which I tried "funannotate update -i fun --cpus 1". But it seems it failed to really set the cpu =1, because I can see the program ran at cpu =2 in the log file:

[12/21/18 16:42:20]: /panfs/roc/msisoft/pasa/2.3.3/Launch_PASA_pipeline.pl -c /path/whole_assembly_only/update_misc/pasa/annotCompare.txt -g /path/whole_assembly_only/update_misc/genome.fa -t /path/whole_assembly_only/update_misc/trinity.fasta.clean -A -L --CPU 2 --annots /path/whole_assembly_only/update_misc/genome.gff3

Do you have some thoughts how to troubleshoot this?

Thank you for your help!

Feng

hyphaltip commented 5 years ago

On thing pasa requires symlink of fasta36 to fasta and make Sure this is in your path.

Jason Stajich, PhD jasonstajich.phd@gmail.com On Dec 21, 2018, 6:54 PM -0500, Feng Li notifications@github.com, wrote:

Hi Jon, I'm also trying to run funannotate update step after funannotate predict was done. I had ~37k gene models predicted from the prediction step. In the first trial of funannotate update, I was running funannotate update -i fun --cpus 20, and it ran ~20 hours and exited with error showing "cannot find program fasta" as follows:

  • Running CMD: /panfs/roc/msisoft/pasa/2.3.3//scripts/cDNA_annotation_comparer.dbi -G /path/to/update_misc/genome.fa --CPU 20 -M '/path/to/training/pasa/Pgt_21_0' > /path/to/pasa_run.log.dir/Pgt_21_0.annotation_compare.21871.out Use of uninitialized value $prog in scalar chomp at /panfs/roc/msisoft/pasa/2.3.3//PerlLib/fasta.ph line 70. Thread 32 terminated abnormally: Cannot find program fasta ERROR, thread 32 exited with error Cannot find program fasta

Error, there were 1 threads (contig jobs) that failed... See error messages above in order to troubleshoot furtherFailed thread (32) info: Thread(32) FAILED CMD: unknown Time to complete: 14356 seconds Error, cmd: /panfs/roc/msisoft/pasa/2.3.3//scripts/cDNA_annotation_comparer.dbi -G /path/to/whole_assembly_only/update_misc/genome.fa --CPU 20 -M '/path/to/training/pasa/Pgt_21_0' > pasa_run.log.dir/Pgt.annotation_compare.21871.out died with ret 256 No such file or directory at /panfs/roc/msisoft/pasa/2.3.3//PerlLib/Pipeliner.pm line 186. Pipeliner::run(Pipeliner=HASH(0xe89218)) called at /panfs/roc/msisoft/pasa/2.3.3/Launch_PASA_pipeline.pl line 1044 [12/20/18 16:07:23]: PASA failed, check log, exiting But we do have fasta36 installed within funannotate pipeline and it seems started "update step" already because I can see some of the contigs were analyzed for UTRs in the pasa_run.log.dir but the program stopped after ~20 hour. Thus I think for this run, it got started at "Running PASA annotation comparison step 1" but got killed in the middle of the analysis. When I try to restart the funannotate update step again and troubleshoot, and the job was killed immediately when it started "Running PASA annotation comparison step 1", not like my first trial. Then this time the error is about "database is locked": DBD::SQLite::db do failed: database is locked at /panfs/roc/msisoft/pasa/2.3.3//PerlLib/DB_connect.pm line 221. failed query: <insert into annotation_admin (date) values (CURRENT_TIMESTAMP)> values: Errors: database is locked at /panfs/roc/msisoft/pasa/2.3.3//PerlLib/DB_connect.pm line 233. DB_connect::RunMod(DB_connect=HASH(0x23902f8), "insert into annotation_admin (date) values (CURRENT_TIMESTAMP)") called at /panfs/roc/msisoft/pasa/2.3.3//scripts/Annotation_store_preloader.dbi line 52 Sorry, couldn't retrieve an ID for the annotation version. :(

There's a link in pasa issues about this (https://github.com/PASApipeline/PASApipeline/issues/74) and it suggested to use cpu = 1 which I tried "funannotate update -i fun --cpus 1". But it seems it failed to really set the cpu =1, because I can see the program ran at cpu =2 in the log file:

[12/21/18 16:42:20]: /panfs/roc/msisoft/pasa/2.3.3/Launch_PASA_pipeline.pl -c /path/whole_assembly_only/update_misc/pasa/annotCompare.txt -g /path/whole_assembly_only/update_misc/genome.fa -t /path/whole_assembly_only/update_misc/trinity.fasta.clean -A -L --CPU 2 --annots /path/whole_assembly_only/update_misc/genome.gff3



Do you have some thoughts how to troubleshoot this?

Thank you for your help!

Feng

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or mute the thread.
lixx3627 commented 5 years ago

Hi Jason,

Thank you for your suggestion. I tried your advice and made a symbolic link for fasta36 to fasta. But now the issue became what I mentioned in the 2nd error above, about "data base is locked".

[12/21/18 21:11:56]: PASA database is SQLite: ~/Analysis/xxx_Jan/xxx_annotation/xxx_annotation/xxx_fun_evidence/training/pasa/21_0
[12/21/18 21:11:56]: /panfs/roc/msisoft/pasa/2.3.3/bin/cdbfasta xxx_predict/whole_assembly_only/update_misc/genome.fa
[12/21/18 21:11:56]: 410 entries from file xxx_predict/whole_assembly_only/update_misc/genome.fa were indexed in file xxx_predict/whole_assembly_only/update_misc/genome.fa.cidx

[12/21/18 21:11:56]: Running PASA annotation comparison step 1
[12/21/18 21:11:56]: /panfs/roc/msisoft/pasa/2.3.3/Launch_PASA_pipeline.pl -c ~/Analysis/xxx_Jan/xxx_annotation/xxx_annotation/xxx_predict/whole_assembly_only/update_misc/pasa/annotCompare.txt -g ~/Analysis/xxx_Jan/xxx_annotation/xxx_annotation/xxx_predict/whole_assembly_only/update_misc/genome.fa -t ~/Analysis/xxx_Jan/xxx_annotation/xxx_annotation/xxx_predict/whole_assembly_only/update_misc/trinity.fasta.clean -A -L --CPU 2 --annots ~/Analysis/xxx_Jan/xxx_annotation/xxx_annotation/xxx_predict/whole_assembly_only/update_misc/genome.gff3
[12/21/18 21:12:27]: -connecting to SQLite db: ~/Analysis/xxx_Jan/xxx_annotation/xxx_annotation/xxx_fun_evidence/training/pasa/21_0
-*** Running PASA pipeine:
* Running CMD: /panfs/roc/msisoft/pasa/2.3.3//scripts/Load_Current_Gene_Annotations.dbi -c ~/Analysis/xxx_Jan/xxx_annotation/xxx_annotation/xxx_predict/whole_assembly_only/update_misc/pasa/annotCompare.txt -g ~/Analysis/xxx_Jan/xxx_annotation/xxx_annotation/xxx_predict/whole_assembly_only/update_misc/genome.fa -P ~/Analysis/xxx_Jan/xxx_annotation/xxx_annotation/xxx_predict/whole_assembly_only/update_misc/genome.gff3  > pasa_run.log.dir/output.annot_loading.17822.out
DBD::SQLite::db do failed: database is locked at /panfs/roc/msisoft/pasa/2.3.3//PerlLib/DB_connect.pm line 221.
failed query: <insert into annotation_admin (date) values (CURRENT_TIMESTAMP)>  values: 
**Errors: database is locked**
 at /panfs/roc/msisoft/pasa/2.3.3//PerlLib/DB_connect.pm line 233.
        DB_connect::RunMod(DB_connect=HASH(0x215f4b8), "insert into annotation_admin (date) values (CURRENT_TIMESTAMP)") called at /panfs/roc/msisoft/pasa/2.3.3//scripts/Annotation_store_preloader.dbi line 52
**Sorry, couldn't retrieve an ID for the annotation version.** :( 

Error, cmd: /panfs/roc/msisoft/pasa/2.3.3//scripts/Load_Current_Gene_Annotations.dbi -c ~/Analysis/xxx_Jan/xxx_annotation/xxx_annotation/xxx_predict/whole_assembly_only/update_misc/pasa/annotCompare.txt -g ~/Analysis/xxx_Jan/xxx_annotation/xxx_annotation/xxx_predict/whole_assembly_only/update_misc/genome.fa -P ~/Analysis/xxx_Jan/xxx_annotation/xxx_annotation/xxx_predict/whole_assembly_only/update_misc/genome.gff3  > pasa_run.log.dir/output.annot_loading.17822.out died with ret 2816 No such file or directory at /panfs/roc/msisoft/pasa/2.3.3//PerlLib/Pipeliner.pm line 186.
        Pipeliner::run(Pipeliner=HASH(0x188faf0)) called at /panfs/roc/msisoft/pasa/2.3.3/Launch_PASA_pipeline.pl line 1044

[12/21/18 21:12:27]: PASA failed, check log, exiting

It seems there's something wrong with the SQLite, which it is trying to use the pasa_db generated from the pasa step when I ran funannotate train.

Do you have suggestions on this kind of error?

Thanks a lot for your help!

Feng

nextgenusfs commented 5 years ago

Try to run the PASA command in the terminal directly and give it a single cpu. I don’t think that will fix it but worth a try. I think the SQLite database might have an incomplete entry or something related the the previous run. Did you run funannotate train previously? And did that run without error?

lixx3627 commented 5 years ago

Hi Jon, Yep I ran funannotate train previously, but during the train step it would always get stuck at pasa so I usually restart train step from pasa by giving 2 cpus and it will finish successfully. You are right that running PASA command alone didn't solve the problem with giving a single cpu, and the same error about "Errors: database is locked".

nextgenusfs commented 5 years ago

You guys can’t run MySQL correct? Not sure if there is something else wrong or not, can you try to run the PASA sample data and see if it completes?

lixx3627 commented 5 years ago

Hi Jon, Nope, I was just told we only have sqlite-PASA but not mysql-PASA in our system.

I still didn't get all the details how it was run but I wonder if it could be the PASA config file that caused the issue? I've been looking at the codes for PASA annotation comparisons and funannotate update pipeline script. And I found the pasa config file the fun_update script was trying to use is annotCompare.txt all the time based on the log file. I wonder if it should be using alignAssembly.txt for Load_Current_Gene_Annotations.dbi?

I also tried to run PASA sample data and it failed: module load funannotate/1.5.0 export PASAHOME=/panfs/roc/msisoft/pasa/2.3.3/ bash ./runMe.SQLite.sh

This run for the sample data seems finished "Comparing Annotations to Alignment Assemblies", "Running Analysis of Alternative Splicing", but failed in "Finding ORFs in PASA assemblies", and the end of the stderr output is:

* Running CMD: ../scripts/pasa_asmbls_to_training_set.dbi --pasa_transcripts_fasta sample_mydb_pasa.sqlite.assemblies.fasta --pasa_transcripts_gff3 sample_mydb_pasa.sqlite.pasa_assemblies.gff3
Error, cmd: /panfs/roc/groups/7/figueroa/lixx3627/src/PASApipeline/sample_data/PASApipeline/scripts/../pasa-plugins/transdecoder/TransDecoder.LongOrfs -t sample_mydb_pasa.sqlite.assemblies.fasta  died with ret -1 at ../scripts/pasa_asmbls_to_training_set.dbi line 148.
Error, cmd: ../scripts/pasa_asmbls_to_training_set.dbi --pasa_transcripts_fasta sample_mydb_pasa.sqlite.assemblies.fasta --pasa_transcripts_gff3 sample_mydb_pasa.sqlite.pasa_assemblies.gff3 died with ret 512 No such file or directory at /panfs/roc/groups/7/figueroa/lixx3627/src/PASApipeline/sample_data/PASApipeline/sample_data/../PerlLib/Pipeliner.pm line 186.
        Pipeliner::run(Pipeliner=HASH(0x1f266e8)) called at ./__run_sample_pipeline.pl line 217 

Do you have suggestions to troubleshoot this? Thank you so much for your help!

Feng

lixx3627 commented 5 years ago

Jon, never mind of the previous note I left. I solved the issue by copying the original sqlite database to another directory like /tmp. The previous issue about "Errors: database is locked" was because when I restarted the PASA annotation comparison it picked up the "updated but interrupted" sqlite database in which the folder also include a sqlite rollback journal from the earlier run, so somehow the program would say the database is locked if it tries to use this unfinished sqlite database. Now it passed the first round of PASA and Running PASA annotation comparison step 2. Hopefully it can be finished successfully! Thank you for your help!

EarlyEvol commented 5 years ago

In the end I believe this was just a HPC IO issue. Not sure though. I regenerated the PASA sqlite DB and reran funannotate update on a copy of the database. This was still super slow. Then I moved the DB to the node /tmp, this also ran super slow. That kind of argues against a storage IO problem, but maybe read requests from SQLite get bogged down some how.

Anyway, a colleague got the dockerfile running on his workstation, and that churned through update on his dataset very quickly. When I need to run update again, I'll just do it on his server. I guess whatever my issue is, it's probably specific to my HPC. Thanks for all the advice!

bioinfouser123 commented 3 years ago

Hi! I am trying to use sqlite for PASA as i am using cluster which does not allow me to use mysql. I have provided DATABASE=<__SQLite__> in pasaalignAssembly.txt file after creating a SQLite.db database using sqlite3. I found that using sqlite database, program will not need the config.txt file. But I keep getting error which I can;'t interpret. I am fairly new to mysql term and understanding. The current error is install_driver(mysql) failed: Can't locate DBD/mysql.pm in @INC (you may need to install the DBD::mysql module). Any help/suggestion is much appreciated. Many thanks in advance.

nextgenusfs commented 3 years ago

can you please open a new issue with the funannotate commands you have tried to run? Its not easy to see issues when you add to ones that are already closed.