Closed tailinhua closed 5 years ago
so for the delay.
"Number of Threads Per Process" should be equal or larger than the number of GPU devices of the node the process is running on.
In your case, you were running 3 process on a single node (may be I am wrong). "Number of Threads Per Process" should be at least the number of GPU devices of this node.
However, we recommend to set "Number of Threads Per Process" to the number of CPU cores or multiple times of CPU cores of this node.
regards.
Mingxu
Dear developers, I have two problem problems.tar.gz
s while running thunder_gpu, the installation process went well, and when I start run refinement jobs, I have encountered two problems, the command line goes like: mpirun -n 3 thunder_gpu demo_3D.json; for both problems. and the output of problem 1 goes like: thunder_gpu: /home/linhua/Programs/THUNDER_self_compile/external/boost/boost/container/vector.hpp:1581: boost::container::vector<T, Allocator>::reference boost::container::vector<T, Allocator>::operator[](boost::container::vector<T, Allocator>::size_type) [with T = Projector; Allocator = boost::container::new_allocator; boost::container::vector<T, Allocator>::reference = Projector&; boost::container::vector<T, Allocator>::size_type = long unsigned int]: Assertion
this->m_holder.m_size > n' failed. [Guinevere:34959] *** Process received signal *** [Guinevere:34959] Signal: Aborted (6) [Guinevere:34959] Signal code: (-6) thunder_gpu: /home/linhua/Programs/THUNDER_self_compile/external/boost/boost/container/vector.hpp:1581: boost::container::vector<T, Allocator>::reference boost::container::vector<T, Allocator>::operator[](boost::container::vector<T, Allocator>::size_type) [with T = Projector; Allocator = boost::container::new_allocator<Projector>; boost::container::vector<T, Allocator>::reference = Projector&; boost::container::vector<T, Allocator>::size_type = long unsigned int]: Assertion
this->m_holder.m_size > n' failed. [Guinevere:34960] Process received signal [Guinevere:34960] Signal: Aborted (6) [Guinevere:34960] Signal code: (-6) [Guinevere:34959] [ 0] /lib64/libpthread.so.0(+0xf6d0)[0x7f250dcd36d0] [Guinevere:34959] [ 1] /lib64/libc.so.6(gsignal+0x37)[0x7f24ff7fc277] [Guinevere:34959] [Guinevere:34960] [ 0] /lib64/libpthread.so.0(+0xf6d0)[0x7feae6d706d0] [Guinevere:34960] [ 1] /lib64/libc.so.6(gsignal+0x37)[0x7fead8899277] [Guinevere:34960] [ 2] /lib64/libc.so.6(abort+0x148)[0x7fead889a968] [Guinevere:34960] [ 3] [ 2] /lib64/libc.so.6(+0x2f096)[0x7fead8892096] [Guinevere:34960] [ 4] /lib64/libc.so.6(+0x2f142)[0x7fead8892142] [Guinevere:34960] [ 5] /lib64/libc.so.6(abort+0x148)[0x7f24ff7fd968] [Guinevere:34959] [ 3] /lib64/libc.so.6(+0x2f096)[0x7f24ff7f5096] [Guinevere:34959] [ 4] /lib64/libc.so.6(+0x2f142)[0x7f24ff7f5142] [Guinevere:34959] [ 5] thunder_gpu[0x48564d] [Guinevere:34960] [ 6] thunder_gpu[0x48564d] [Guinevere:34959] [ 6] thunder_gpu[0x486251] [Guinevere:34960] [ 7] thunder_gpu[0x486251] [Guinevere:34959] [ 7] thunder_gpu(_ZN9Optimiser12refreshScaleEbb+0x105a)[0x4a80ba] [Guinevere:34960] [ 8] thunder_gpu(_ZN9Optimiser12refreshScaleEbb+0x105a)[0x4a80ba] [Guinevere:34959] [ 8] thunder_gpu(_ZN9Optimiser12correctScaleEbbb+0x40)[0x4a9150] [Guinevere:34960] [ 9] thunder_gpu(_ZN9Optimiser12correctScaleEbbb+0x40)[0x4a9150] [Guinevere:34959] [ 9] thunder_gpu(_ZN9Optimiser4initEv+0x1062)[0x4c64f2] [Guinevere:34960] [10] thunder_gpu(_ZN9Optimiser4initEv+0x1062)[0x4c64f2] [Guinevere:34959] [10] thunder_gpu(_ZN9Optimiser3runEv+0xbe)[0x4cac7e] [Guinevere:34960] [11] thunder_gpu(_ZN9Optimiser3runEv+0xbe)[0x4cac7e] [Guinevere:34959] [11] thunder_gpu(main+0x3aa)[0x45b4ca] [Guinevere:34960] [12] thunder_gpu(main+0x3aa)[0x45b4ca] [Guinevere:34959] [12] /lib64/libc.so.6(libc_start_main+0xf5)[0x7fead8885445] [Guinevere:34960] [13] /lib64/libc.so.6(libc_start_main+0xf5)[0x7f24ff7e8445] [Guinevere:34959] [13] thunder_gpu[0x45f71a] [Guinevere:34960] End of error message thunder_gpu[0x45f71a] [Guinevere:34959] End of error messagePrimary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.
mpirun noticed that process rank 1 with PID 0 on node Guinevere exited on signal 6 (Aborted).
I have renamed the json file and the output log file to problem1.XXX
the output for problem2 goes like: 01/07/2019 10:22:19.717 WARN [LOGGER_SYS] MASTER: _rS is Larger than _r, Set _rS to _r [Guinevere:13778] Process received signal [Guinevere:13778] Signal: Floating point exception (8) [Guinevere:13778] Signal code: Integer divide-by-zero (1) [Guinevere:13778] Failing at address: 0x4c0e2f [Guinevere:13778] [ 0] /lib64/libpthread.so.0(+0xf6d0)[0x7fde05ad36d0] [Guinevere:13778] [ 1] thunder_gpu(_ZN9Optimiser12expectationGEv+0x74f)[0x4c0e2f] [Guinevere:13778] [ 2] thunder_gpu(_ZN9Optimiser3runEv+0x194a)[0x4cbfba] [Guinevere:13778] [ 3] thunder_gpu(main+0x3aa)[0x45b4ba] [Guinevere:13778] [ 4] /lib64/libc.so.6(libc_start_main+0xf5)[0x7fddf75e8445] [Guinevere:13778] [ 5] thunder_gpu[0x45f70a] [Guinevere:13778] End of error message [Guinevere:13779] Process received signal [Guinevere:13779] Signal: Floating point exception (8) [Guinevere:13779] Signal code: Integer divide-by-zero (1) [Guinevere:13779] Failing at address: 0x4c0e2f [Guinevere:13779] [ 0] /lib64/libpthread.so.0(+0xf6d0)[0x7f6e4b90d6d0] [Guinevere:13779] [ 1] thunder_gpu(_ZN9Optimiser12expectationGEv+0x74f)[0x4c0e2f] [Guinevere:13779] [ 2] thunder_gpu(_ZN9Optimiser3runEv+0x194a)[0x4cbfba] [Guinevere:13779] [ 3] thunder_gpu(main+0x3aa)[0x45b4ba] [Guinevere:13779] [ 4] /lib64/libc.so.6(libc_start_main+0xf5)[0x7f6e3d422445] [Guinevere:13779] [ 5] thunder_gpu[0x45f70a] [Guinevere:13779] End of error message
Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.
mpiexec noticed that process rank 2 with PID 0 on node Guinevere exited on signal 8 (Floating point exception).
I have renamed the json file and the output log file to problem2.XXX
My workstation has two Intel(R) Xeon(R) CPU E5-2640 v4 CPUs, and one Gtx1080 Ti & two Titan x(Pascal) GPUs. Thank you very much in advance!