STEP10 - frozen at sqm_counter

dgarrs commented 1 year ago

Hello,

I have kind of a weird behavior of SQM in step 10. The first time it stopped with a “Program finished abnormally” when executing sqm_counter of the first sample (around Line 661 in the syslog). After that, I managed to restart from step 10 but each time, the terminal and subprocesses will freeze, having to kill the process and start again. I though it might be due to RAM and threads, so I’ve been playing around reducing the number of threads. Any number between 10 and 46 will produce the same result, freezing of the terminal at some point. Maybe I’m too impatient? I kill the process because I observe that the CPU usage suddenly drops to 0 while counting of reads, and none of the disks are being read/written (not even after 1 day), while the RAM usage remains high but not even close to the maximum. In the attached image you can see that the samtools view processes have an status of “D” (Uninterruptible sleep (usually IO)).

In addition, I commented lines 73 to 100 of the 10.mapsamples.pl script to avoid having to create the reference for mapping every time. I don’t know if that could be the issue? Also, every time I restart the STEP10, I have to manually delete the intermediate files 10.project.contigcov and 10.project.mapcount, otherwise it will skip step 10 thinking that it has been completed.

I’m thinking that it might have to do with samtools view when trying to retrieve information from very heavy BAM files, at least those are the frozen processes… So I’ve also been playing around trying to add more threads to the samtools view (added the option “-@ 4” to lines 318 and 421 of the 10.mapsamples.pl), while reducing the global threads for the project to 10. Still same result.

I’ve been monitoring RAM usage and doesn’t seem to be the problem. However, the samples I’m analyzing are quite complex (soil metagenomes). They are a total of 9 samples. The first 3 samples have around 70 M reads (no problem in processing these, the BAM files are about 3 GB), while the last 6 ones have close to 1 billion reads for each sample (BAM files of around 40 GB). Reads are 150 bp. The number of contigs being processed are 3.8 million, expanding 10.6 billion bp in total. The problems usually start with the heavy samples.

I assembled the samples outside SQM using megahit with normalized reads (to max 50 k-mer cov). Not-normalized reads are used for mapping in SQM. I did two different coassemblies (6 samples and other 3 independently) and then merged both using CD-HIT-EST (also outside SQM). I could not run minimus2 due to the huge amount of resulting contigs. Contigs were filtered by 1000 bp and provided to SQM as external assembly.

I am using Ubuntu 18.04 running in WSL2 in windows 10, with 48 CPUs and 383 GB of RAM (only 300 GB available for WSL, but could be increased if needed).

I attach the syslog although it is a bit chaotic as I’ve been restarting the step 10 multiple times.

Any advice will be helpful. Thanks!

syslog.zip SqueezeMeta_conf.zip

fpusan commented 1 year ago

I see that the project is hosted in the Windows drive F:\ (not directly inside your WSL2 virtual drive). Is this an external drive? In any case even if it is an internal drive, accessing the Windows filesystem from within WSL2 carries a performance penalty (see eg https://github.com/microsoft/WSL/issues/4197). Maybe that's also a factor in play here.

dgarrs commented 1 year ago

The F: drive is not an external one. Given the high amount of data, I cannot store the WSL2 virtual drive in the same one as in the one used for data processing.

I just run the file system performance in both drives obtaining the following:

keel9000@DESKTOP-TTU69EU:~$ dd if=/dev/zero of=/mnt/f/testfile bs=1M count=1000 status=progress 968884224 bytes (969 MB, 924 MiB) copied, 2 s, 484 MB/s 1000+0 records in 1000+0 records out 1048576000 bytes (1.0 GB, 1000 MiB) copied, 2.16161 s, 485 MB/s

keel9000@DESKTOP-TTU69EU:~$ sudo dd if=/dev/zero of=/testfile bs=1M count=1000 status=progress 1000+0 records in 1000+0 records out 1048576000 bytes (1.0 GB, 1000 MiB) copied, 0.582882 s, 1.8 GB/s

fpusan commented 1 year ago

Still not convinced this is telling us the whole story.

I think that the command you used for testing does only one big rw operation. This means that you only suffer latency once. If this latency is significant when accessing the windows disk, you will feel a much bigger effect in a real case example with lots of rw operations happening at the same time (particularly if there are a lot of threads attempting them in parallel).

The official recommendation from Microsoft is to store the files in the WSL FS. Doing otherwise may "it may significantly slow down performance."

https://learn.microsoft.com/en-us/windows/wsl/setup/environment#file-storage

dgarrs commented 1 year ago

Should I give it a try? move everything to the WSL FS and restart from step 10?

fpusan commented 1 year ago

If you can fit it, yes, I would try that. I am also noticing that you are very close from running out of memory, that may be also a reason why things are going slow (though it doesn't seem like you are using the swap yet in that screenshot)

dgarrs commented 1 year ago

Ok, I will try to move the WSL FS to another disk with more space and then move everything there (might take me some time).

I realized after the firsts errors that WSL by default uses half of the RAM, so I increased it to 300 GB. Since then no SWAP and I'm always below 250 GB of RAM usage.

fpusan commented 1 year ago

Great, hopefully this improves things, otherwise let us know!

dgarrs commented 1 year ago

Sorry for the late response. Moving the WSL to the file system worked and this and the rest of the SQM steps completed without any further problems. Thanks a lot for this suggestion and for developing this amazing tool!

fpusan commented 1 year ago

Glad to hear! Closing issue

jtamames / SqueezeMeta

STEP10 - frozen at sqm_counter #695