Very large genome polishing with limited CPU time

alelim-bio commented 5 years ago

Hello Nanopolish,

I was wondering if I could get your assistance on a computing dilemma we are having. To give some background, we currently have a very large plant genome assembly totaling to 343,222 contigs covering approximately 13.5 GB. Additionally, our .fastq basecalled file totals to approximately 487 GB in size. We wish to polish our genome however, just using a test set of two flow cells, a total of 29 GB, to polish our genome is taking longer than a 1 day to finish 1000 contigs on a dual Intel “Skylake” 6130 node, I feel at this rate it is unfeasible to maintain when we wish to incorporate the whole dataset.

I was wondering if you maybe could offer some advice to solve this problem or if you have had any experience dealing with a polishing project this large.

Thank you for your time. Kind Regards,

Alex

jts commented 5 years ago

Sorry but I don't think it is feasible to use nanopolish for this project. Try medaka, which is much faster.

Jared

alelim-bio commented 5 years ago

Hello Jared,

Thank you for your answer. As a follow-up question, we have access to three different supercomputers will it be feasible with a multi-node job? I believe the largest node we can access has 704 threads or we can access multiple Sky-Lakes.

Additionally, if we lowered the amount of .fastq files for polishing would it be feasible to polish with nanopolish, for instance half the amount of files. We would like to try to get the best assembly and I believe medaka doesn't polish as well as nanopolish as it doesn't incorporate signal-level information.

Kind Regards,

Alex

jts commented 5 years ago

How many nodes do you have access to?

alelim-bio commented 5 years ago

Hello Jared,

We have access to around 65 nodes with a 32 core Skylake Xeon's per node.

Kind Regards,

Alex

jts commented 5 years ago

Ok, it is worth trying then. I suggest using 4 threads per process, and as many processes as you are able to run.

Jared

jts / nanopolish

Very large genome polishing with limited CPU time #589