Possible lock on RHEL 7.6 with FUTEX_WAIT_PRIVATE in strace

TomHarrop commented 5 years ago

Hi,

I have a run of racon using Illumina reads for a third round of polishing on a Nanopore assembly which seems to have loaded the sequences into memory then stopped. It's been sitting there with 0% CPU and 900 GB of RAM for 5 days.

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND              
 90090      me   20   0  904.8g 870.3g   1448 S   0.0 86.4   1256:19 racon

Here's my call to racon:

racon -t 144 path/to/pe_reads.fq path/to/aln.sam path/to/flye_denovo_full.racon.fasta > output/flye_denovo_full.racon.fasta 2> logs/flye_denovo_full_racon.log

Here's the progress in the log:

[racon::Polisher::initialize] loaded target sequences
[racon::Polisher::initialize] loaded sequences
[racon::Polisher::initialize] loaded overlaps

Here's the output from strace:

 strace -p 90090
strace: Process 90090 attached
futex(0x5591deb4cae4, FUTEX_WAIT_PRIVATE, 0, NULL^Cstrace: Process 90090 detached
 <detached ...>

The SAM file is 481 GB, the reads are 354 GB (uncompressed fastq) and the assembly is 1.2 GB.

I've tried the run a few times and it always does the same thing. Strangely, I have completed a previous round of racon with the same fastq following the exact same alignment steps and the same racon call. I aligned the reads with bwa-mem (default parameters). The assembly is from flye, and has been polished once with the original Nanopore reads and once with the Illumina reads. I have seen the same behaviour with a canu assembly from the same data running racon on the same computer.

The server has 144 threads and 1 TB RAM, and is running RHEL 7.6 (kernel 3.10.0-862.11.6.el7.x86_64). I'm running racon release 1.3.2 in an ubuntu 18.10 singularity container. Previous runs with the same container have completed so I don't think that is the problem. I think the runs that did finish were on another server, with the same specs and same OS (same kernel version).

I'm a bit stumped, so I would appreciate any ideas. And my colleagues want me to give the RAM back :) Thanks!

rvaser commented 5 years ago

Hi Tom, when using Illumina data, memory consumption equals to 1.5 * FASTQ size + PAF/SAM size + FASTA size + C, which in your case is around 1TB so maybe it hits the swap? Quite odd that 900GB RAM is free and you managed to run it once before. Were the file sizes different?

Best regards, Robet

TomHarrop commented 5 years ago

Hi Robert,

To clarify, this machine has 1 TB total and racon was using ~900 GB. Swap wasn't being used according to free so I don't think I'm running out of RAM at this point.

It ran past the "aligned overlap x/y" stage on the other machine using the exact same files, but that machine ran out of memory because someone else was running a big job.

Thanks, Tom

rvaser commented 5 years ago

Hi Tom, I had similar issues raised before and a new one appeared yesterday on a different OS. I'll try and find the bug on the data the user provided. Have no idea though what might be the problem.

Best regards, Robert

TomHarrop commented 5 years ago

Thanks, I'll keep an eye on #110. You can close this if you like and I'll reopen it if necessary.

rvaser commented 5 years ago

Well, I installed the same CentOS in VirtualBox and was able to run it successfully. I have no idea why it stalls on your server or the one in the other issue :/

TomHarrop commented 5 years ago

Yes, it's strange that it works on one computer and not the other. BTW I've run racon_test successfully on both machines.

rvaser commented 5 years ago

Does the Illumina run stall at the same spot every run? Do you maybe have a smaller dataset which stalls as well? It is quite hard to debug on that large files.

TomHarrop commented 5 years ago

I can't tell if it's the exact same spot, but the log always stops in as shown in the OP.

I ran it again to check for sure if it was hitting swap. Here's the output from free while it's stalled:

              total        used        free      shared  buff/cache   available
Mem:           1.0T        846G        158G         43M        2.7G        158G
Swap:           43G         23G         19G

I guess I should try splitting the sam/fastq into chunks and running racon on each chunk. If I get time to do that I will let you know how it goes.

TomHarrop commented 5 years ago

Hi Robert,

I split the sam into 1,228 chunks, each with 1 million records, and the fastq into corresponding chunks, and ran racon on each chunk individually.

Some of the runs hung (10 of the 1228). However, when I re-ran them with the exact same command, they ran successfully.

This seems to be a random error. I guess with the large amount of data I am using for the full run it is getting me every time. Can you think of anything else that's worth trying?

Not great news, I'm afraid!

Cheers, Tom

rvaser commented 5 years ago

Hi Tom, thanks for the follow up. I have no idea what could be causing it. Do you mind sharing one of those 10 chunks so I can test locally?

Best regards, Robert

TomHarrop commented 5 years ago

Hi Robert,

Apologies for the long delay on this. It was a bit of work to split the fasta, fastq and SAM into corresponding chunks.

Here's a chunk that hung the first time I ran it, but finished fine the second time: https://github.com/TomHarrop/biopython-index-test/raw/master/chunk.tar.gz

Can you let me know when you have the files so I can remove it from GH?

Thanks, Tom

rvaser commented 5 years ago

Hi Tom, I have downloaded the data, you can remove it.

Sorry for the late response! Best regards, Robert

rvaser commented 5 years ago

Hi Tom, I run your data on Ubuntu 16.04. 10 times in a row, it worked in all cases. No idea what is causing the problem on RHEL 7.6 :/

Best regards, Robert

isovic / racon

Possible lock on RHEL 7.6 with FUTEX_WAIT_PRIVATE in strace #109