kaist-ina / stellatrain

Official Github repository for the SIGCOMM '24 paper "Accelerating Model Training in Multi-cluster Environments with Consumer-grade GPUs"
46 stars 7 forks source link

Segmentation Fault During Test Script Execution #1

Open kuilz opened 1 week ago

kuilz commented 1 week ago

Hello,

First, I would like to express my gratitude for open-sourcing this amazing project. I encountered the following error messages while trying to reproduce the results according to the Run test script section in the README, which ultimately led to a program crash.

[gpu2-System-Product-Name:306  :0:306] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
[gpu2-System-Product-Name:305  :0:305] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
32 0x000000000022aac5 _start()  ???:0
=================================
Traceback (most recent call last):
  File "/home/stellatrain/explore-dp/backend/test/test_end_to_end.py", line 160, in <module>
    mp.spawn(method,
  File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 239, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 197, in start_processes
    while not context.join():
  File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 140, in join
    raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 1 terminated with signal SIGSEGV

Initially, I suspected that insufficient memory might be the cause, and I tried the following solutions:

  1. Reduced the batch size.
  2. Run Docker with increased memory and shared memory settings.

Unfortunately, these attempts did not resolve the issue.


Experimental Setup

The experiment is set up on two nodes, each with two GPUs. Specific details are as follows:


Could you please assist me with this issue?

Thank you very much!

wjuni commented 2 days ago

Have you tried building the container in your environment? The built container might not be compatible in every environment.

kuilz commented 2 days ago

Thank you very much for your suggestion.

I have tried building the image in my own environment by the following cmd:

  1. Navigate to the root directory of the project (where the Dockerfile is located).
  2. Run the following command:
    docker build -t my_st_img .

However, I still encountered the same error, which seem to occur during the synchronization backend process. Detailed output of the error is shown below:

Received from master : key=56@bias
Stashing data for 56@bias
Barrier OK
Barrier OK
 99%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊ | 127/128 [00:50<00:00,  2.63it/s, loss=1.97e+3]Synchronize Backend, num_futures: 16099
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 128/128 [00:50<00:00,  2.52it/s, loss=1.98e+3]Synchronize Backend, num_futures: 17050
Terminating FasterDPEngine
Terminating FasterDPEngine
Writing statistics at /tmp/time_log_20241106054943_1107.json, last event was 127@53@conv2.weight
Writing statistics at /tmp/time_log_20241106054943_1106.json, last event was 127@53@conv2.weight
[gpu2-System-Product-Name:1107 :0:1107] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
[gpu2-System-Product-Name:1106 :0:1106] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
==== backtrace (tid:   1106) ====
 0 0x0000000000042520 __sigaction()  ???:0
 1 0x000000000004ac11 c10::cuda::CUDAKernelLaunchRegistry::has_failed()  ???:0
 2 0x000000000004ba9d c10::cuda::c10_cuda_check_implementation()  ???:0
 3 0x0000000000014bdc c10::cuda::CUDACachingAllocator::Native::DeviceCachingAllocator::insert_events()  CUDACachingAllocator.cpp:0
 4 0x0000000000017fd8 c10::cuda::CUDACachingAllocator::Native::DeviceCachingAllocator::free()  CUDACachingAllocator.cpp:0
 5 0x000000000001838c c10::cuda::CUDACachingAllocator::Native::local_raw_delete()  :0
 6 0x000000000046eb2a c10::StorageImpl::~StorageImpl()  :0
 7 0x0000000000044ead c10::TensorImpl::~TensorImpl()  TensorImpl.cpp:0
 8 0x00000000000515aa c10::intrusive_ptr<c10::TensorImpl, c10::UndefinedTensorImpl>::reset_()  /usr/local/lib/python3.10/dist-packages/torch/include/c10/util/intrusive_ptr.h:291
 9 0x00000000000515aa c10::intrusive_ptr<c10::TensorImpl, c10::UndefinedTensorImpl>::reset_()  /usr/local/lib/python3.10/dist-packages/torch/include/c10/util/intrusive_ptr.h:274
10 0x00000000000515aa c10::intrusive_ptr<c10::TensorImpl, c10::UndefinedTensorImpl>::~intrusive_ptr()  /usr/local/lib/python3.10/dist-packages/torch/include/c10/util/intrusive_ptr.h:370
11 0x00000000000515aa at::TensorBase::~TensorBase()  /usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBase.h:80
12 0x00000000000515aa at::Tensor::~Tensor()  /usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:90
13 0x00000000000515aa std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, at::Tensor>::~pair()  /usr/include/c++/11/bits/stl_pair.h:211
14 0x00000000000515aa __gnu_cxx::new_allocator<std::__detail::_Hash_node<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, at::Tensor>, true> >::destroy<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, at::Tensor> >()  /usr/include/c++/11/ext/new_allocator.h:168
15 0x00000000000515aa std::allocator_traits<std::allocator<std::__detail::_Hash_node<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, at::Tensor>, true> > >::destroy<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, at::Tensor> >()  /usr/include/c++/11/bits/alloc_traits.h:535
16 0x00000000000515aa std::__detail::_Hashtable_alloc<std::allocator<std::__detail::_Hash_node<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, at::Tensor>, true> > >::_M_deallocate_node()  /usr/include/c++/11/bits/hashtable_policy.h:1894
17 0x00000000000515aa std::__detail::_Hashtable_alloc<std::allocator<std::__detail::_Hash_node<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, at::Tensor>, true> > >::_M_deallocate_nodes()  /usr/include/c++/11/bits/hashtable_policy.h:1916
18 0x00000000000515aa std::_Hashtable<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, at::Tensor>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, at::Tensor> >, std::__detail::_Select1st, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::clear()  /usr/include/c++/11/bits/hashtable.h:2320
19 0x00000000000515aa std::_Hashtable<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, at::Tensor>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, at::Tensor> >, std::__detail::_Select1st, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::~_Hashtable()  /usr/include/c++/11/bits/hashtable.h:1532
20 0x000000000004beaf std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, at::Tensor, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, at::Tensor> > >::~unordered_map()  /usr/include/c++/11/bits/unordered_map.h:102
21 0x000000000004beaf FasterDpEngine::~FasterDpEngine()  /home/stellatrain/explore-dp/backend/src/engine/core.cpp:61
22 0x0000000000045495 secure_getenv()  ???:0
23 0x0000000000045610 exit()  ???:0
24 0x0000000000279d5b Py_Exit()  ???:0
25 0x000000000026750f PyGC_Collect()  ???:0
26 0x000000000026743d PyErr_PrintEx()  ???:0
27 0x0000000000255e02 PyRun_SimpleStringFlags()  ???:0
28 0x0000000000254cf5 Py_RunMain()  ???:0
29 0x000000000022abcd Py_BytesMain()  ???:0
30 0x0000000000029d90 __libc_init_first()  ???:0
31 0x0000000000029e40 __libc_start_main()  ???:0
32 0x000000000022aac5 _start()  ???:0
=================================
==== backtrace (tid:   1107) ====
 0 0x0000000000042520 __sigaction()  ???:0
 1 0x000000000004ac11 c10::cuda::CUDAKernelLaunchRegistry::has_failed()  ???:0
 2 0x000000000004ba9d c10::cuda::c10_cuda_check_implementation()  ???:0
 3 0x0000000000014bdc c10::cuda::CUDACachingAllocator::Native::DeviceCachingAllocator::insert_events()  CUDACachingAllocator.cpp:0
 4 0x0000000000017fd8 c10::cuda::CUDACachingAllocator::Native::DeviceCachingAllocator::free()  CUDACachingAllocator.cpp:0
 5 0x000000000001838c c10::cuda::CUDACachingAllocator::Native::local_raw_delete()  :0
 6 0x000000000046eb2a c10::StorageImpl::~StorageImpl()  :0
 7 0x0000000000044ead c10::TensorImpl::~TensorImpl()  TensorImpl.cpp:0
 8 0x00000000000515aa c10::intrusive_ptr<c10::TensorImpl, c10::UndefinedTensorImpl>::reset_()  /usr/local/lib/python3.10/dist-packages/torch/include/c10/util/intrusive_ptr.h:291
 9 0x00000000000515aa c10::intrusive_ptr<c10::TensorImpl, c10::UndefinedTensorImpl>::reset_()  /usr/local/lib/python3.10/dist-packages/torch/include/c10/util/intrusive_ptr.h:274
10 0x00000000000515aa c10::intrusive_ptr<c10::TensorImpl, c10::UndefinedTensorImpl>::~intrusive_ptr()  /usr/local/lib/python3.10/dist-packages/torch/include/c10/util/intrusive_ptr.h:370
11 0x00000000000515aa at::TensorBase::~TensorBase()  /usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBase.h:80
12 0x00000000000515aa at::Tensor::~Tensor()  /usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:90
13 0x00000000000515aa std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, at::Tensor>::~pair()  /usr/include/c++/11/bits/stl_pair.h:211
14 0x00000000000515aa __gnu_cxx::new_allocator<std::__detail::_Hash_node<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, at::Tensor>, true> >::destroy<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, at::Tensor> >()  /usr/include/c++/11/ext/new_allocator.h:168
15 0x00000000000515aa std::allocator_traits<std::allocator<std::__detail::_Hash_node<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, at::Tensor>, true> > >::destroy<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, at::Tensor> >()  /usr/include/c++/11/bits/alloc_traits.h:535
16 0x00000000000515aa std::__detail::_Hashtable_alloc<std::allocator<std::__detail::_Hash_node<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, at::Tensor>, true> > >::_M_deallocate_node()  /usr/include/c++/11/bits/hashtable_policy.h:1894
17 0x00000000000515aa std::__detail::_Hashtable_alloc<std::allocator<std::__detail::_Hash_node<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, at::Tensor>, true> > >::_M_deallocate_nodes()  /usr/include/c++/11/bits/hashtable_policy.h:1916
18 0x00000000000515aa std::_Hashtable<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, at::Tensor>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, at::Tensor> >, std::__detail::_Select1st, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::clear()  /usr/include/c++/11/bits/hashtable.h:2320
19 0x00000000000515aa std::_Hashtable<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, at::Tensor>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, at::Tensor> >, std::__detail::_Select1st, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::~_Hashtable()  /usr/include/c++/11/bits/hashtable.h:1532
20 0x000000000004beaf std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, at::Tensor, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, at::Tensor> > >::~unordered_map()  /usr/include/c++/11/bits/unordered_map.h:102
21 0x000000000004beaf FasterDpEngine::~FasterDpEngine()  /home/stellatrain/explore-dp/backend/src/engine/core.cpp:61
22 0x0000000000045495 secure_getenv()  ???:0
23 0x0000000000045610 exit()  ???:0
24 0x0000000000279d5b Py_Exit()  ???:0
25 0x000000000026750f PyGC_Collect()  ???:0
26 0x000000000026743d PyErr_PrintEx()  ???:0
27 0x0000000000255e02 PyRun_SimpleStringFlags()  ???:0
28 0x0000000000254cf5 Py_RunMain()  ???:0
29 0x000000000022abcd Py_BytesMain()  ???:0
30 0x0000000000029d90 __libc_init_first()  ???:0
31 0x0000000000029e40 __libc_start_main()  ???:0
32 0x000000000022aac5 _start()  ???:0
=================================
Traceback (most recent call last):
  File "/home/stellatrain/explore-dp/backend/test/test_end_to_end.py", line 160, in <module>
    mp.spawn(method,
  File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 239, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 197, in start_processes
    while not context.join():
  File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 140, in join
    raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 1 terminated with signal SIGSEGV
stellatrain@gpu2-System-Product-Name:/home/stellatrain/explore-dp# 

Any insights or suggestions you could provide would be greatly appreciated.

kuilz commented 1 day ago

Here is some additional details about bugs.

  1. Despite the epoches number, every epoch finished.

    • epoch=1
1 2
  1. After finishing all the epoches, the process of syschronization failed.

    • According to the code, we observe that syschronization is executed immediately after finishing all epochs.
3 4