Open stevenybw opened 3 years ago
Can you please try with export UCX_ERROR_SIGNALS=
on driver and --conf spark.executorEnv.UCX_ERROR_SIGNALS=
in spark conf.
After add export UCX_ERROR_SIGNALS=
and export SPARK_UCX_HOME=$HOME/sparkucx/target
together with Spark configuration, the problem still exists. JVM terminates with SIGSEGV with the following stack track:
--------------- T H R E A D ---------------
Current thread (0x00007fc2fc02e000): GCTaskThread "GC Thread#27" [stack: 0x00007fc2cd2b9000,0x00007fc2cd3b9000] [id=261425]
Stack: [0x00007fc2cd2b9000,0x00007fc2cd3b9000], sp=0x00007fc2cd3b7b70, free space=1018k
Native frames: (J=compiled Java code, A=aot compiled Java code, j=interpreted, Vv=VM code, C=native code)
V [libjvm.so+0x5dbe40] ClassLoaderDataGraph::roots_cld_do(CLDClosure*, CLDClosure*)+0x20
V [libjvm.so+0x7d2a15] G1RootProcessor::process_java_roots(G1RootClosures*, G1GCPhaseTimes*, unsigned int)+0x65
V [libjvm.so+0x7d319e] G1RootProcessor::evacuate_roots(G1ParScanThreadState*, unsigned int)+0x9e
V [libjvm.so+0x78285c] G1ParTask::work(unsigned int)+0xec
V [libjvm.so+0xea176d] GangWorker::loop()+0x4d
V [libjvm.so+0xe0acaa] Thread::call_run()+0x13a
V [libjvm.so+0xc5293e] thread_native_entry(Thread*)+0xee
siginfo: si_signo: 11 (SIGSEGV), si_code: 1 (SEGV_MAPERR), si_addr: 0x00000000000001d0
We try to construct a smaller dataset (11GB), shrinking the size about 4x, the result is correct and no more SIGSEGV. However, every time we try to experiment with larger dataset (44GB), the problem can be stably reproduced. This would suggest that somewhere buffer has overflow.
Can you please send hs_err.log
file with whole stacktrace.
Does it work with the same parameters and not using ucx? Does it work with -XX:+UseParallelGC
?
Does it work with the same parameters and not using ucx?
Yes. I've verified that just now, without sparkucx it is fine.
Does it work with -XX:+UseParallelGC?
No. Adding "-XX:+UseParallelGC" to both the driver and the executor won't help, the problem still exists.
Does it happen at the beginning of the job, at map phase or reduce phase or at the end? Do you see something in dmesg
? SparkUCX is doing memory mapping of the file, so it requires a lot of virtual memory. Can you try with `--conf spark.shuffle.ucx.memory.useOdp=true
Does it happen at the beginning of the job, at map phase or reduce phase or at the end?
Some tasks can complete, for example, 447 succeed among 448 tasks in the map stage but the last failed. Both could happen, sometimes map phase, sometimes reduce phase.
Do you see something in dmesg?
Rarely we may observe the following triggered by SparkUCX on the driver side.
2149 [Tue May 11 22:16:34 2021] ib_umem_get: failed to get user pages, nr_pages=512
2150 [Tue May 11 22:16:34 2021] mlx5_0:mr_umem_get:713:(pid 3268577): umem get failed (-512)
Can you try with
`--conf spark.shuffle.ucx.memory.useOdp=true
The problem still exists.
We found an interesting phenomenon that when increasing the number of reducers from 224 to 448, the word count can produce correct result. Moreover, for 224-reducer configuration, it will always ends with SIGSEGV; for 448-reducer configuration, it will always produce correct result. From 224 to 448 partition, the message size from each mapper to each reducer is reduced. We guess it is possible that somewhere a fixed-size buffer has overflow. Hope this information is useful.
Configuration
./contrib/configure-release --with-java
Spark launch commandline
Scala application:
Phenomena
Of the first stage, with total 448 tasks, 447 tasks have been finished. After that, the Java Runtime is terminated by SIGSEGV as follow:
With the hs_err_pid3253764.log: