thuem / THUNDER

A particle-filter framework for robust cryoEM 3D reconstruction
GNU General Public License v2.0
56 stars 10 forks source link

Weired behavior when start running #15

Closed tailinhua closed 5 years ago

tailinhua commented 5 years ago

Dear developers, I have two problem problems.tar.gz

s while running thunder_gpu, the installation process went well, and when I start run refinement jobs, I have encountered two problems, the command line goes like: mpirun -n 3 thunder_gpu demo_3D.json; for both problems. and the output of problem 1 goes like: thunder_gpu: /home/linhua/Programs/THUNDER_self_compile/external/boost/boost/container/vector.hpp:1581: boost::container::vector<T, Allocator>::reference boost::container::vector<T, Allocator>::operator[](boost::container::vector<T, Allocator>::size_type) [with T = Projector; Allocator = boost::container::new_allocator; boost::container::vector<T, Allocator>::reference = Projector&; boost::container::vector<T, Allocator>::size_type = long unsigned int]: Assertion this->m_holder.m_size > n' failed. [Guinevere:34959] *** Process received signal *** [Guinevere:34959] Signal: Aborted (6) [Guinevere:34959] Signal code: (-6) thunder_gpu: /home/linhua/Programs/THUNDER_self_compile/external/boost/boost/container/vector.hpp:1581: boost::container::vector<T, Allocator>::reference boost::container::vector<T, Allocator>::operator[](boost::container::vector<T, Allocator>::size_type) [with T = Projector; Allocator = boost::container::new_allocator<Projector>; boost::container::vector<T, Allocator>::reference = Projector&; boost::container::vector<T, Allocator>::size_type = long unsigned int]: Assertionthis->m_holder.m_size > n' failed. [Guinevere:34960] Process received signal [Guinevere:34960] Signal: Aborted (6) [Guinevere:34960] Signal code: (-6) [Guinevere:34959] [ 0] /lib64/libpthread.so.0(+0xf6d0)[0x7f250dcd36d0] [Guinevere:34959] [ 1] /lib64/libc.so.6(gsignal+0x37)[0x7f24ff7fc277] [Guinevere:34959] [Guinevere:34960] [ 0] /lib64/libpthread.so.0(+0xf6d0)[0x7feae6d706d0] [Guinevere:34960] [ 1] /lib64/libc.so.6(gsignal+0x37)[0x7fead8899277] [Guinevere:34960] [ 2] /lib64/libc.so.6(abort+0x148)[0x7fead889a968] [Guinevere:34960] [ 3] [ 2] /lib64/libc.so.6(+0x2f096)[0x7fead8892096] [Guinevere:34960] [ 4] /lib64/libc.so.6(+0x2f142)[0x7fead8892142] [Guinevere:34960] [ 5] /lib64/libc.so.6(abort+0x148)[0x7f24ff7fd968] [Guinevere:34959] [ 3] /lib64/libc.so.6(+0x2f096)[0x7f24ff7f5096] [Guinevere:34959] [ 4] /lib64/libc.so.6(+0x2f142)[0x7f24ff7f5142] [Guinevere:34959] [ 5] thunder_gpu[0x48564d] [Guinevere:34960] [ 6] thunder_gpu[0x48564d] [Guinevere:34959] [ 6] thunder_gpu[0x486251] [Guinevere:34960] [ 7] thunder_gpu[0x486251] [Guinevere:34959] [ 7] thunder_gpu(_ZN9Optimiser12refreshScaleEbb+0x105a)[0x4a80ba] [Guinevere:34960] [ 8] thunder_gpu(_ZN9Optimiser12refreshScaleEbb+0x105a)[0x4a80ba] [Guinevere:34959] [ 8] thunder_gpu(_ZN9Optimiser12correctScaleEbbb+0x40)[0x4a9150] [Guinevere:34960] [ 9] thunder_gpu(_ZN9Optimiser12correctScaleEbbb+0x40)[0x4a9150] [Guinevere:34959] [ 9] thunder_gpu(_ZN9Optimiser4initEv+0x1062)[0x4c64f2] [Guinevere:34960] [10] thunder_gpu(_ZN9Optimiser4initEv+0x1062)[0x4c64f2] [Guinevere:34959] [10] thunder_gpu(_ZN9Optimiser3runEv+0xbe)[0x4cac7e] [Guinevere:34960] [11] thunder_gpu(_ZN9Optimiser3runEv+0xbe)[0x4cac7e] [Guinevere:34959] [11] thunder_gpu(main+0x3aa)[0x45b4ca] [Guinevere:34960] [12] thunder_gpu(main+0x3aa)[0x45b4ca] [Guinevere:34959] [12] /lib64/libc.so.6(libc_start_main+0xf5)[0x7fead8885445] [Guinevere:34960] [13] /lib64/libc.so.6(libc_start_main+0xf5)[0x7f24ff7e8445] [Guinevere:34959] [13] thunder_gpu[0x45f71a] [Guinevere:34960] End of error message thunder_gpu[0x45f71a] [Guinevere:34959] End of error message

Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.


mpirun noticed that process rank 1 with PID 0 on node Guinevere exited on signal 6 (Aborted).

I have renamed the json file and the output log file to problem1.XXX

the output for problem2 goes like: 01/07/2019 10:22:19.717 WARN [LOGGER_SYS] MASTER: _rS is Larger than _r, Set _rS to _r [Guinevere:13778] Process received signal [Guinevere:13778] Signal: Floating point exception (8) [Guinevere:13778] Signal code: Integer divide-by-zero (1) [Guinevere:13778] Failing at address: 0x4c0e2f [Guinevere:13778] [ 0] /lib64/libpthread.so.0(+0xf6d0)[0x7fde05ad36d0] [Guinevere:13778] [ 1] thunder_gpu(_ZN9Optimiser12expectationGEv+0x74f)[0x4c0e2f] [Guinevere:13778] [ 2] thunder_gpu(_ZN9Optimiser3runEv+0x194a)[0x4cbfba] [Guinevere:13778] [ 3] thunder_gpu(main+0x3aa)[0x45b4ba] [Guinevere:13778] [ 4] /lib64/libc.so.6(libc_start_main+0xf5)[0x7fddf75e8445] [Guinevere:13778] [ 5] thunder_gpu[0x45f70a] [Guinevere:13778] End of error message [Guinevere:13779] Process received signal [Guinevere:13779] Signal: Floating point exception (8) [Guinevere:13779] Signal code: Integer divide-by-zero (1) [Guinevere:13779] Failing at address: 0x4c0e2f [Guinevere:13779] [ 0] /lib64/libpthread.so.0(+0xf6d0)[0x7f6e4b90d6d0] [Guinevere:13779] [ 1] thunder_gpu(_ZN9Optimiser12expectationGEv+0x74f)[0x4c0e2f] [Guinevere:13779] [ 2] thunder_gpu(_ZN9Optimiser3runEv+0x194a)[0x4cbfba] [Guinevere:13779] [ 3] thunder_gpu(main+0x3aa)[0x45b4ba] [Guinevere:13779] [ 4] /lib64/libc.so.6(libc_start_main+0xf5)[0x7f6e3d422445] [Guinevere:13779] [ 5] thunder_gpu[0x45f70a] [Guinevere:13779] End of error message

Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.


mpiexec noticed that process rank 2 with PID 0 on node Guinevere exited on signal 8 (Floating point exception).

I have renamed the json file and the output log file to problem2.XXX

My workstation has two Intel(R) Xeon(R) CPU E5-2640 v4 CPUs, and one Gtx1080 Ti & two Titan x(Pascal) GPUs. Thank you very much in advance!

Zarrathustra commented 5 years ago

so for the delay.

"Number of Threads Per Process" should be equal or larger than the number of GPU devices of the node the process is running on.

In your case, you were running 3 process on a single node (may be I am wrong). "Number of Threads Per Process" should be at least the number of GPU devices of this node.

However, we recommend to set "Number of Threads Per Process" to the number of CPU cores or multiple times of CPU cores of this node.

regards.

Mingxu