nextgenusfs / funannotate

Eukaryotic Genome Annotation Pipeline
http://funannotate.readthedocs.io
BSD 2-Clause "Simplified" License
300 stars 82 forks source link

Need help to run it in slurm #942

Open Nitin123-4 opened 11 months ago

Nitin123-4 commented 11 months ago

Hi team,

As we know that funannoate runs with multiple cpus, but a lot of steps in between runs in parallel. A lot of times it gives memory related errors.

Can someone recommend the correct slurm/SBATCH related settings to run: train/predict/annotate etc commands?

Thanks.

hyphaltip commented 11 months ago

The honest answer is it depends. On the size of genome, proteome and transcript files.

The training step will be memory intensive if you have large amounts of rnaseq. This is dependent on trinity. And PASA depending if you have a lot of complexity.

The predict step is lower memory but the diamond alignments or diff tools maybe more or less efficient. The annotate step can take a lot of mem to process interpro scan xml because of way the Python xml parser is used.

Generally I run with 24-32gb for a lot of regular jobs and 128-256gb for training. It's pretty empirically driven and depends on the factors and input data size.

Nitin123-4 commented 11 months ago

Thanks for your response.

I can see in annotate step phobius uses all the available cpus and I am giving 150GB RAM still it is breaking with memory error. It's bit confusing how exactly slurm job we should setup for a 80 cpu and 256 GB RAM machine.

nextgenusfs commented 11 months ago

Lower the number of cpus it's probably unlikely to run much faster than with about 24 cpus. The multiprocessing steps in python do not share memory and therefore the memory footprint is effectively multiplied by the number of cpus you run. So if you have a large genome and it's loaded into memory then it could be asking for a lot of RAM. I don't have an HPC but never have had an issue with 24 cpus / 256 GB of RAM. Granted that is almost exclusively fungal genomes.

@hyphaltip interesting that the interproscan parsing is using a lot of RAM. I don't have an issue running it on my MacBook with 16 GB of RAM.... at least I can't recall any mem errors.

hyphaltip commented 11 months ago

It's something I can try to provide to reproduce but otherwise I would only need 24-32gb ram for the job but need 128+. I can setup some tests. It had to do with the rewriting / look ahead step with the broken xml header in some versions of iprscan maybe? I think there is also an xml to tsv converter in iprscan I wonder if would be more reliable ?

nextgenusfs commented 11 months ago

Okay, I don't want to include interproscan scripts directly, but maybe we just push folks to using the JSON output and show them how to convert XML to JSON -- that would probably be easiest to parse. But maybe you are right that TSV is lower memory because you can go line-by-line through the file, whereas the entire XML object is loaded (same would probably be true for JSON). But that still can't occupy that much memory, ie shouldn't be more than the file size.