sequencing / isaac_aligner

Isaac Genome Alignment Software
Other
37 stars 8 forks source link

execution error -- libgomp: Thread creation failed: Resource temporarily unavailable #10

Closed grendon closed 9 years ago

grendon commented 10 years ago

The reference genome TAIR10 for arabidopsis thaliana. Size 119481543 The Sakata reads files come from this url http://1001genomes.org/data/JGI/JGIHeazlewood2011/releases/current/TAIR10/strains/

PE reads 60173258

read length 100 avg depth 36.87 std depth 313.73

Several benchmarking jobs were submitted to an SGI UV1000 node with 384 Intel Xeon X7542 and 2TB of memory. I assigned 150gb of memory to each job. Number of cores assigned to each job varied = 36, 24, 16, 8 All but one job finished successfully. The job with 36 cores failed. Below are the last few lines of the error log.

2014-07-25 14:56:11 [2b53393de700] Opened fastq stream on /home/a-m/gren\ don/tair-isaac-pipeline_test/index/sakata_reads/lane1_read1.fastq 2014-07-25 14:56:11 [2b53393de700] Opened fastq stream on /home/a-m/gren\ don/tair-isaac-pipeline_test/index/sakata_reads/lane1_read2.fastq 2014-07-25 14:56:11 [2b53393de700] Resetting Fastq data for 5000000 clus\ ters 2014-07-25 14:56:13 [2b53393de700] Resetting Fastq data done for 5000000\ clusters 2014-07-25 14:56:54 [2b53393de700] Loading Fastq data done. Loaded 50000\ 00 clusters for TileMetadata(1101, 1, 1, 5000000, 0) 2014-07-25 14:56:54 [2b53393de700] Sorting matches by barcode for TileMe\ tadata(1101, 1, 1, 5000000, 0) 2014-07-25 14:56:54 [2b53391dd700] Loading matches for TileMetadata(1101\ , 2, 1, 5000000, 1)

libgomp: Thread creation failed: Resource temporarily unavailable

rpetrovski commented 10 years ago

This failure is typically a result of going over the system set limits. For example if the thread stack size times the number of threads iSAAC attempts to create goes over the ulimit -v, it will fail to create threads. Can you please post the ulimit -a output.

I'm not familiar with SGI UV 1000. From a random specification on the web I gather the maximum number of cores per compute node you can have is 16. Am I correct?

It looks like you've tried to override the default iSAAC threading with -j option. Is that the case? There is usually no reason to do that unless you are debugging or working around a poorly-built system. iSAAC picks up the number of compute threads from the amount of hardware threads supported by the system. Going above that will cause more threads to access memory concurrently than the system is designed for. This will result in L1,2,3 cache trashing and therefore suboptimal performance. Setting lower values can be justified in cases when there isn't enough RAM to accommodate the per-thread memory allocations iSAAC does. However, 150G of memory is way more than enough with rest of the iSAAC options set to their defaults. Do you actually have 150G of RAM physically available on the node?

lsmainzer commented 9 years ago

Rpetrovski:

sorry about the delay in replying to you. These nodes have 384 Intel Xeon X7542 @ 2.67 GHZ CPUs per node, and 2 TB of RAM per node. Thus, we are not exceeding the number of cores on the node by asking iSAAC aligner to use 48 threads. We are also not exceeding the available RAM.

However, I notice that iSAAC does not tend to respect the user-set number of threads specified to it on the command line using option -j. For example, when I ran tests on a different computer, which has 48 dual-threaded cores, I specified -j 48, but from reading the logs I notice that in fact all 96 virtual threads were used. Is this expected behavior, or am I seeing a bug?

Tthe "libgomp: Thread creation failed" error shows up specifically when running on a shared cluster node. I understand iSAAC was designed to run alone on a node, so maybe that is why we are seeing this error. For example, if a node has 384 cores, but we want iSAAC to only use 48, we will specify 48 via the -j option and also tell the PBS script to submit the iSAAC job with a limitation of 48 threads, so that other users could utilize the other threads. However, iSAAC appears to ignore these limitatons: not only ignoring the -j option, but also not complying with the scheduler's limits.

I think it is reasonable to expect software to use all resources on a node it is running on. However, I think we would never have seen the "libgomp: Thread creation failed" problem if iSAAC did not try to use more than the 48 threads specified with the "-j 48" setting. This does not seem to be correct behavior, since it negates the entire reason for having the -j option.

Are you aware of this behavior? Have you ever encountered it? We would be happy to run more tests to clarify.

Thank you very much, Luda

rpetrovski commented 9 years ago

True, the iSAAC-01 does not limit the number of threads __gnu_parallel::sort uses. I'll try to come up with some sort of workaround in January.

Roman.

rpetrovski commented 9 years ago

Replaced gnu parallel sort with home-made one. Please try iSAAC-01.15.01.28.

Roman.

chunhualiao commented 8 years ago

I just saw a similar error with gcc 4.9.2. The input code is a simple OpenMP nested parallelism code. The problem is the workstation has 72 logical processors. The program will try to start 72*72 threads by default, which triggers "libgomp: Thread creation failed: Resource temporarily unavailable" .

The solution is to limit the number of threads at each level of parallelism, using num_threads() clause.