Overlap jobs failing at trimming, Canu 1.9 on Linux Slurm

elbourret commented 4 years ago

This issue is continuing issue 1582 and is related to issue 1502. The issue 1582 was closed due to being "idle", but I was actually in contact with our cluster managers to try to pinpoint the source of the problem. There is a development.

The issue: brieftly, two overlap jobs are failing at trimming on Canu 1.9 on a Linux SLURM cluster, with the below error in results/trimming/1-overlapper/overlap.*_*.out:

Thread 05 writes    reads 627128-632382 (32876 overlaps 32876/292173/0 kmer hits with/without overlap/skipped)
writeToFile()-- After writing 1 out of 1 'ovFile::writeBuffer::bl' objects (8 bytes each): Cannot allocate memory

Allocating massive amounts of memory (340G) is not helping, and the job is only using ~6G of RAM. Several things were tried in the previous issue 1582, but the problem was not solved. It was suggested that the problem was due to lack of SWAP. This does not appear to be the case.

One of our cluster computer scientists went over the problem, and thinks the problem is related to multiple threads writing simultaneously to the same file. The problem appears to be solved by forcing the job to use a single thread with the flag -t 1.

He further commented: "I can think of 2 possible general causes -- 1 -- a bug in the overlapInCore routine during this file write. or 2 a filesystem problem, bug, or issue with the interaction between the filesystem (or filesystem server) and canu. I also made a test on our older bulk file system, and it appeared to succeed past the initial write step where it usually fails. This could indicate that there is a particular bug or issue between canu and the ceph filesystem, but it could also be a canu bug that only is manifested when using cephfs. I will ask my colleagues next week who maintain the filesystem whether they have any ideas."

I will try manually running the failing jobs with a single thread to proceed with the assembly, and report my success.

cmaggior commented 4 years ago

Since your job is failing at the trimming step, is this related to the k-mer counting (meryl-count)? If so, I'm receiving a similar error: "Kmer counting (meryl-count) jobs failed, tried 2 times, giving up."

skoren commented 4 years ago

This is unlikely to be related to k-mer counts and I doubt can be fixed within Canu.

Most of the jobs ran successfully except for two so I would expect the bug with multithreading would be exposed on more than two out of several hundred jobs. That coupled with the fact that the error is coming from the system call running out of memory makes me thing this is a cephfs bug. I still think it could be swap memory related as I suggested in #1582. It was showing free memory but no swap memory there as well so allocating massive memory isn't going to fix the swap issue. I'd guess cephfs is using the swap space for something and multi-threading might cause more load on the swap space and thus one thread succeeds while multiple threads fail, when the one thread job runs does the swap ever hit 0?

elbourret commented 4 years ago

Running the jobs with option obtovlThreads=1 solved the issue. I'll report back on what our cluster maintainers say about the possibility of an issue with cephfs.

Now, it is failing again, at unitigging meryl-count step with error:

Used 11.433 GB out of 13.000 GB to store   2977378475 kmers.
Used 11.558 GB out of 13.000 GB to store   3011972373 kmers.

Writing results to './Avthrum_canu2.11.meryl.WORKING', using 4 threads.
writeToFile()-- After writing 1 out of 1 'dataBlockLenMax' objects (8 bytes each): Cannot allocate memory
slurmstepd: Exceeded step memory limit at some point.
slurmstepd: Exceeded job memory limit at some point.

writeToFile()-- After writing 1 out of 1 'dataBlockLenMax' objects (8 bytes each): Cannot allocate memory

I guess this is a completely unrelated issue that should be fixed by giving more memory to the meryl-count job?

elbourret commented 4 years ago

The unitigging meryl-count issue actually appears related to the trimming overlapInCore issue.

I solved the meryl count issue by using a single thread (with option merylThreads=1). See my comment on issue 1603.

skoren commented 4 years ago

Can you capture more details for the failing and successful jobs. First, are you building Canu on your system or using a pre-compiled release? If build locally, which compiler and version? Can you get vmstat -s right after a failing job and right after a successful job with 1 thread?

elbourret commented 4 years ago

I am using Canu release v1.9. Our system is running Linux login0 4.15.0-60-generic 67~16.04.1-Ubuntu. Further information from the computer scientist that was helping me troubleshoot: "None of our compute nodes have swap (except hydra has a little bit for some hydra reason), so lack of swap space should not be an issue. It is not possible to rule out a cephfs bug, but it would be a bug that so far only has affected your jobs. I have not looked into the code enough to really comment on the possibility that maybe the multi-thread simultaneous writes to a single file are not synchronized in such a way that they not interfere with each other on this type of file system (which can behave differently from a single hard drive). That is my guess to the cause, but only a guess."

I submitted the problematic trimming overlapper jobs, using both 8 and 1 thread in OverlapInCore, followed immediately by a call to vmstat -s. I will report the results here when they finish. I will do the same of the failing meryl-count jobs, later.

brianwalenz commented 4 years ago

There's a potential code level fix for this over in issue #1603.

elbourret commented 4 years ago

Thank you @brianwalenz !

For some reason, my test jobs have not gone through the queue on our cluster yet, so I cannot report the requested vmstat -s on failing and successful jobs. Will done when I have the results.

However, my main assembly progressed to unitigging, and out of 952 overlapping jobs at unitigging step, 4 failed with apparently the same error. Here is the tail of one of the failing overlap jobs unitigging/1-overlapper/overlap.2436688_378.out:

Range: 1944528-2709330.  Store has 5675729 reads.
Chunk: 23901 reads/thread -- (G.endRefID=2709330 - G.bgnRefID=1944528) / G.Num_PThreads=4 / 8

Starting 1944528-2709330 with 23901 per thread

Thread 02 processes reads 1992330-2016230
Thread 00 processes reads 1944528-1968428
Thread 01 processes reads 1968429-1992329
Thread 03 processes reads 2016231-2040131
writeToFile()-- After writing 1 out of 1 'ovFile::writeBuffer::bl' objects (8 bytes each): Cannot allocate memory

I will try to recompile canu with the code level fix in issue #1603 and report here the results. Otherwise I guess that I can use utgovlThreads=1.

elbourret commented 4 years ago

The new code seems to have resolved my issues.

With the new code, the failing trimming overlap jobs are successful with 8 threads, and the failing unitigging overlap jobs have already run past the initial stage where they failed before, without having to force a single thread.

skoren commented 4 years ago

Closing this issue and will track with #1603. This should be fixed in the next release.

For the curious, it seems that there was never an issue within Canu or with the output files. The errno value was getting set by another process (likely the system since it had to re-allocate cache memory after the write) but Canu was picking that up as a write error.

marbl / canu

Overlap jobs failing at trimming, Canu 1.9 on Linux Slurm #1600