jtamames / SqueezeMeta

A complete pipeline for metagenomic analysis
GNU General Public License v3.0
382 stars 81 forks source link

Why is STEP4 taking longer to run after stricter QC and host read removal? #901

Open SamBrutySci opened 2 weeks ago

SamBrutySci commented 2 weeks ago

Hi,

I ran the pipeline a little while ago, using pretty light QC on my reads and without removing host reads. I have a few different sample types which I'm running in coassembly separately. Each sample type (6 metagenomes per sample type) took about 10 days to run the full pipeline.

Now, after choosing to use a more stringent QC approach and filtering out host reads before using squeezemeta (reduces reads by about 30 - 40%), STEP4 is taking far longer to run. So far 6 out of 11 sample types finished within 20 days (full pipeline) and some are still running (on STEP4 for ~17 - 21 days).

I was wondering if you had any suggestions as to why more stringent QC and host read removal would increase the STEP4 run time so much?

In case it has any impact, these samples were started at the same time as the one in issue #893 where the multiple start stops were likely the issue. However, I have tried re-running these samples into a completely new project/directory and the hanging on STEP4 persists.

Thanks so much for all your help with my issues!

SamBrutySci commented 2 weeks ago

diamond.nr.log for one sample that has finished in a reasonable time and one stuck in STEP4

Sample that finished -- diamond.nr.log Sample stuck on STEP4 for ~18 days -- diamond.nr.log

fpusan commented 2 weeks ago

Hard to tell, but the second one took 4 times longer to load the query sequences so maybe it has just more ORFs?

What's the number of contigs/ORFs before and after QC / host removal for the same samples? You can find those fasta files in project/results/01.*.fasta and `project/results/03.*.faa

jtamames commented 2 weeks ago

A more stringent QC control can improve the assembly by removing low-quality seqs that can hinder it. Hence more contigs, more ORFs, and longer to process. Nevertheless, 18 days is too much. What kind of computer are you using? Best, J

SamBrutySci commented 2 weeks ago

This is all running on a HPC, 350 GB RAM and 32 cores. Which is the exact same size as previously. They're pretty big samples, around 12.5 gigabases per sample (6 samples per coassembly)

For the sample still running:

Before strict QC 01.Cameor.fasta contains 6745493 sequences 03.Cameor.faa contains 8117038 sequences

After strict QC 01.Cameor.fasta contains 3634824 sequences 03.Cameor.faa contains 5170450 sequences

For the one that finished:

Before strict QC 01.0015.fasta contains 7662580 sequences 03.0015.faa contains 8008910 sequences

After strict QC 01.0015.fasta contains 1903809 sequences 03.0015.faa contains 2592522 sequences

fpusan commented 2 weeks ago

So you indeed have less sequences now, even if it's taking much longer. Any chance your HPC filesystem is more strained now than it was before? If there is latency accessing the database DIAMOND can take significantly more time to run. An easy way to notice if you are IO limited is looking at CPU usage. When not loading data DIAMOND will be using all the CPUs you throw at it at 100%. But if it starts having to wait for database reads then the average CPU use will decrease.

SamBrutySci commented 2 weeks ago

I've been keeping an eye on cpu usage and its pretty stable at 100% throughout the day. I've contacted the HPC admins to see if they have any ideas/solutions!

Do you think I can speed the runtime up by chucking more CPUs at it? Not exactly an elegant fix, but I want the jobs done! Would I have to restart from STEP1 if I change the number of CPUs?

In the meantime do you guys have any other ideas/things to check? No worries if not, we can wait and see what the admins come back with on my end.

fpusan commented 2 weeks ago

It's hard to tell and it seems strange that it takes longer with a smaller dataset. Maybe the DIAMOND developer will have some insight. If you have 350Gb of RAM you can probably increase the DIAMOND block size a lot (-b parameter when calling SqueezeMeta). Increasing this will speed things up at the cost of memory usage. By default we calculate it based on the available RAM but we cap it at 15 IIRC because we've had issues with DIAMOND running out of memory before.