Computational requirements when picking up funannotate train after timeout during PASA step

janniksven commented 1 year ago

Dear team behind funannotate,

thanks a lot for doing a great job of simplifying de-novo genome annotations with this pipeline.

I ran funannotate train without major problems (using Singularity), however, as experienced by other users before, my cluster does not allow the usage of MySQL. Thus, with the genomes that I am planning to annotate (quite fragmented, > 30000 scaffolds, > 200Mbases) I am quickly running into timeouts in my cluster environment during the single-threaded SQLite step.

Will picking up on the job (that failed during the SQLite step due to timeout) decrease the computational requirements down to a level where I can run all the "rest" of the job outside a cluster environment (e.g. 4 cores, 16GB RAM) ?

It does not really matter here if it takes a day or two longer because of running it outside a cluster environment, but is it likely that the script will then run out of memory or take too long (weeks) after the SQLite step if provided with fewer computational resources?

Kind regards!

hyphaltip commented 1 year ago

Just to check. We run MySQL as singularity instance in our cluster. So you can startup a MySQL instance on node you are running funannotate on (or another node as a long running job) and connect to it in the funannaotye job.

My impression has been some of the funannotate train steps don’t work as well with transcripts alone because it wants to do the abundance filtering with kalisto. Maybe we should test that part of workflow better. But I think you can still give prebuilt clustered transcripts and see.

janniksven commented 1 year ago

Thanks for the quick reply!

This sounds like a great idea, leveraging the power of containers even more. I will certainly try this! Any caveats to be aware of? Would you reckon setting up the MySQL users in the MySQL container and creating a simple alias inside the funannotate container such as mysql="singularity exec myql.sif mysql" will be enough?

The idea with giving it prebuilt clustered transcript appeared before becoming aware of the fact that funannotate is able to pick up at the PASA step using prior-computed outputs. Thus, the idea would be to move the output (everything up to PASA step) to the smaller server, start the MySQL backend and run the rest there using fewer resources. As, I prefer the other idea and would prefer to run everything on the cluster this remains just the back-up plan though.

janniksven commented 1 year ago

Accessing a containerised, running MySQL instance from a working node has proven too dificult on my cluster and I thus ran the remaining parts of funannotate train on a smaller server (12 CPUs, 120GB RAM).

nextgenusfs / funannotate

Computational requirements when picking up funannotate train after timeout during PASA step #860