Open kuilz opened 1 month ago
Have you tried building the container in your environment? The built container might not be compatible in every environment.
Thank you very much for your suggestion.
I have tried building the image in my own environment by the following cmd:
docker build -t my_st_img .
However, I still encountered the same error, which seem to occur during the synchronization backend process. Detailed output of the error is shown below:
Received from master : key=56@bias
Stashing data for 56@bias
Barrier OK
Barrier OK
99%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊ | 127/128 [00:50<00:00, 2.63it/s, loss=1.97e+3]Synchronize Backend, num_futures: 16099
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 128/128 [00:50<00:00, 2.52it/s, loss=1.98e+3]Synchronize Backend, num_futures: 17050
Terminating FasterDPEngine
Terminating FasterDPEngine
Writing statistics at /tmp/time_log_20241106054943_1107.json, last event was 127@53@conv2.weight
Writing statistics at /tmp/time_log_20241106054943_1106.json, last event was 127@53@conv2.weight
[gpu2-System-Product-Name:1107 :0:1107] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
[gpu2-System-Product-Name:1106 :0:1106] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
==== backtrace (tid: 1106) ====
0 0x0000000000042520 __sigaction() ???:0
1 0x000000000004ac11 c10::cuda::CUDAKernelLaunchRegistry::has_failed() ???:0
2 0x000000000004ba9d c10::cuda::c10_cuda_check_implementation() ???:0
3 0x0000000000014bdc c10::cuda::CUDACachingAllocator::Native::DeviceCachingAllocator::insert_events() CUDACachingAllocator.cpp:0
4 0x0000000000017fd8 c10::cuda::CUDACachingAllocator::Native::DeviceCachingAllocator::free() CUDACachingAllocator.cpp:0
5 0x000000000001838c c10::cuda::CUDACachingAllocator::Native::local_raw_delete() :0
6 0x000000000046eb2a c10::StorageImpl::~StorageImpl() :0
7 0x0000000000044ead c10::TensorImpl::~TensorImpl() TensorImpl.cpp:0
8 0x00000000000515aa c10::intrusive_ptr<c10::TensorImpl, c10::UndefinedTensorImpl>::reset_() /usr/local/lib/python3.10/dist-packages/torch/include/c10/util/intrusive_ptr.h:291
9 0x00000000000515aa c10::intrusive_ptr<c10::TensorImpl, c10::UndefinedTensorImpl>::reset_() /usr/local/lib/python3.10/dist-packages/torch/include/c10/util/intrusive_ptr.h:274
10 0x00000000000515aa c10::intrusive_ptr<c10::TensorImpl, c10::UndefinedTensorImpl>::~intrusive_ptr() /usr/local/lib/python3.10/dist-packages/torch/include/c10/util/intrusive_ptr.h:370
11 0x00000000000515aa at::TensorBase::~TensorBase() /usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBase.h:80
12 0x00000000000515aa at::Tensor::~Tensor() /usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:90
13 0x00000000000515aa std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, at::Tensor>::~pair() /usr/include/c++/11/bits/stl_pair.h:211
14 0x00000000000515aa __gnu_cxx::new_allocator<std::__detail::_Hash_node<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, at::Tensor>, true> >::destroy<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, at::Tensor> >() /usr/include/c++/11/ext/new_allocator.h:168
15 0x00000000000515aa std::allocator_traits<std::allocator<std::__detail::_Hash_node<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, at::Tensor>, true> > >::destroy<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, at::Tensor> >() /usr/include/c++/11/bits/alloc_traits.h:535
16 0x00000000000515aa std::__detail::_Hashtable_alloc<std::allocator<std::__detail::_Hash_node<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, at::Tensor>, true> > >::_M_deallocate_node() /usr/include/c++/11/bits/hashtable_policy.h:1894
17 0x00000000000515aa std::__detail::_Hashtable_alloc<std::allocator<std::__detail::_Hash_node<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, at::Tensor>, true> > >::_M_deallocate_nodes() /usr/include/c++/11/bits/hashtable_policy.h:1916
18 0x00000000000515aa std::_Hashtable<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, at::Tensor>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, at::Tensor> >, std::__detail::_Select1st, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::clear() /usr/include/c++/11/bits/hashtable.h:2320
19 0x00000000000515aa std::_Hashtable<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, at::Tensor>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, at::Tensor> >, std::__detail::_Select1st, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::~_Hashtable() /usr/include/c++/11/bits/hashtable.h:1532
20 0x000000000004beaf std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, at::Tensor, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, at::Tensor> > >::~unordered_map() /usr/include/c++/11/bits/unordered_map.h:102
21 0x000000000004beaf FasterDpEngine::~FasterDpEngine() /home/stellatrain/explore-dp/backend/src/engine/core.cpp:61
22 0x0000000000045495 secure_getenv() ???:0
23 0x0000000000045610 exit() ???:0
24 0x0000000000279d5b Py_Exit() ???:0
25 0x000000000026750f PyGC_Collect() ???:0
26 0x000000000026743d PyErr_PrintEx() ???:0
27 0x0000000000255e02 PyRun_SimpleStringFlags() ???:0
28 0x0000000000254cf5 Py_RunMain() ???:0
29 0x000000000022abcd Py_BytesMain() ???:0
30 0x0000000000029d90 __libc_init_first() ???:0
31 0x0000000000029e40 __libc_start_main() ???:0
32 0x000000000022aac5 _start() ???:0
=================================
==== backtrace (tid: 1107) ====
0 0x0000000000042520 __sigaction() ???:0
1 0x000000000004ac11 c10::cuda::CUDAKernelLaunchRegistry::has_failed() ???:0
2 0x000000000004ba9d c10::cuda::c10_cuda_check_implementation() ???:0
3 0x0000000000014bdc c10::cuda::CUDACachingAllocator::Native::DeviceCachingAllocator::insert_events() CUDACachingAllocator.cpp:0
4 0x0000000000017fd8 c10::cuda::CUDACachingAllocator::Native::DeviceCachingAllocator::free() CUDACachingAllocator.cpp:0
5 0x000000000001838c c10::cuda::CUDACachingAllocator::Native::local_raw_delete() :0
6 0x000000000046eb2a c10::StorageImpl::~StorageImpl() :0
7 0x0000000000044ead c10::TensorImpl::~TensorImpl() TensorImpl.cpp:0
8 0x00000000000515aa c10::intrusive_ptr<c10::TensorImpl, c10::UndefinedTensorImpl>::reset_() /usr/local/lib/python3.10/dist-packages/torch/include/c10/util/intrusive_ptr.h:291
9 0x00000000000515aa c10::intrusive_ptr<c10::TensorImpl, c10::UndefinedTensorImpl>::reset_() /usr/local/lib/python3.10/dist-packages/torch/include/c10/util/intrusive_ptr.h:274
10 0x00000000000515aa c10::intrusive_ptr<c10::TensorImpl, c10::UndefinedTensorImpl>::~intrusive_ptr() /usr/local/lib/python3.10/dist-packages/torch/include/c10/util/intrusive_ptr.h:370
11 0x00000000000515aa at::TensorBase::~TensorBase() /usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBase.h:80
12 0x00000000000515aa at::Tensor::~Tensor() /usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:90
13 0x00000000000515aa std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, at::Tensor>::~pair() /usr/include/c++/11/bits/stl_pair.h:211
14 0x00000000000515aa __gnu_cxx::new_allocator<std::__detail::_Hash_node<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, at::Tensor>, true> >::destroy<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, at::Tensor> >() /usr/include/c++/11/ext/new_allocator.h:168
15 0x00000000000515aa std::allocator_traits<std::allocator<std::__detail::_Hash_node<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, at::Tensor>, true> > >::destroy<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, at::Tensor> >() /usr/include/c++/11/bits/alloc_traits.h:535
16 0x00000000000515aa std::__detail::_Hashtable_alloc<std::allocator<std::__detail::_Hash_node<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, at::Tensor>, true> > >::_M_deallocate_node() /usr/include/c++/11/bits/hashtable_policy.h:1894
17 0x00000000000515aa std::__detail::_Hashtable_alloc<std::allocator<std::__detail::_Hash_node<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, at::Tensor>, true> > >::_M_deallocate_nodes() /usr/include/c++/11/bits/hashtable_policy.h:1916
18 0x00000000000515aa std::_Hashtable<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, at::Tensor>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, at::Tensor> >, std::__detail::_Select1st, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::clear() /usr/include/c++/11/bits/hashtable.h:2320
19 0x00000000000515aa std::_Hashtable<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, at::Tensor>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, at::Tensor> >, std::__detail::_Select1st, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::~_Hashtable() /usr/include/c++/11/bits/hashtable.h:1532
20 0x000000000004beaf std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, at::Tensor, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, at::Tensor> > >::~unordered_map() /usr/include/c++/11/bits/unordered_map.h:102
21 0x000000000004beaf FasterDpEngine::~FasterDpEngine() /home/stellatrain/explore-dp/backend/src/engine/core.cpp:61
22 0x0000000000045495 secure_getenv() ???:0
23 0x0000000000045610 exit() ???:0
24 0x0000000000279d5b Py_Exit() ???:0
25 0x000000000026750f PyGC_Collect() ???:0
26 0x000000000026743d PyErr_PrintEx() ???:0
27 0x0000000000255e02 PyRun_SimpleStringFlags() ???:0
28 0x0000000000254cf5 Py_RunMain() ???:0
29 0x000000000022abcd Py_BytesMain() ???:0
30 0x0000000000029d90 __libc_init_first() ???:0
31 0x0000000000029e40 __libc_start_main() ???:0
32 0x000000000022aac5 _start() ???:0
=================================
Traceback (most recent call last):
File "/home/stellatrain/explore-dp/backend/test/test_end_to_end.py", line 160, in <module>
mp.spawn(method,
File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 239, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 197, in start_processes
while not context.join():
File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 140, in join
raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 1 terminated with signal SIGSEGV
stellatrain@gpu2-System-Product-Name:/home/stellatrain/explore-dp#
Any insights or suggestions you could provide would be greatly appreciated.
Here is some additional details about bugs.
Despite the epoches number, every epoch finished.
After finishing all the epoches, the process of syschronization failed.
Based on the complete error messages above, I have noticed the following information:
19 0x00000000000515aa std::_Hashtable<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, at::Tensor>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, at::Tensor> >, std::__detail::_Select1st, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::~_Hashtable() /usr/include/c++/11/bits/hashtable.h:1532
20 0x000000000004beaf std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, at::Tensor, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, at::Tensor> > >::~unordered_map() /usr/include/c++/11/bits/unordered_map.h:102
21 0x000000000004beaf FasterDpEngine::~FasterDpEngine() /home/stellatrain/explore-dp/backend/src/engine/core.cpp:61
22 0x0000000000045495 secure_getenv() ???:0
23 0x0000000000045610 exit() ???:0
24 0x0000000000279d5b Py_Exit() ???:0
25 0x000000000026750f PyGC_Collect() ???:0
26 0x000000000026743d PyErr_PrintEx() ???:0
27 0x0000000000255e02 PyRun_SimpleStringFlags() ???:0
28 0x0000000000254cf5 Py_RunMain() ???:0
29 0x000000000022abcd Py_BytesMain() ???:0
30 0x0000000000029d90 __libc_init_first() ???:0
31 0x0000000000029e40 __libc_start_main() ???:0
32 0x000000000022aac5 _start() ???:0
=================================
From this section of the backtrace, it appears that the error occurs in the destructor of FasterDpEngine
, specifically during the cleanup of Tensor objects in an unordered_map
, where a segmentation fault is triggered.
Possible causes:
Attempted solutions:
reset()
on smart pointers after stopping threads to release them.cudaDeviceSynchronize()
to ensure all CUDA operations are completed.tensor
-holding map in order with mutex protection.Updated code:
FasterDpEngine::~FasterDpEngine() {
std::cout << "Terminating FasterDPEngine" << std::endl;
finished_ = true;
if (barrier_manager_thread_ != nullptr) {
pthread_cond_broadcast(&shared_props_->barrier_ipc_cond_);
barrier_manager_thread_->join();
barrier_manager_thread_.reset();
}
if (chore_manager_thread_ != nullptr) {
finished_cond_.notify_all();
chore_manager_thread_->join();
chore_manager_thread_.reset();
}
if (model_complete_manager_thread_ != nullptr) {
layer_model_completed_version_map_cond_.notify_all();
model_complete_manager_thread_->join();
model_complete_manager_thread_.reset();
}
if (cpu_shmem_return_manager_thread_ != nullptr) {
cpu_shmem_use_map_cond_.notify_all();
cpu_shmem_return_manager_thread_->join();
cpu_shmem_return_manager_thread_.reset();
}
if (backward_delegate_thread_ != nullptr) {
backward_delegate_cond_.notify_all();
backward_delegate_thread_->join();
backward_delegate_thread_.reset();
}
cudaDeviceSynchronize();
{
std::lock_guard<std::mutex> lock(map_gpu_param_tensor_mutex_);
map_gpu_param_tensor_.clear();
}
{
std::lock_guard<std::mutex> lock(map_gpu_grad_tensor_mutex_);
map_gpu_grad_tensor_.clear();
}
{
std::lock_guard<std::mutex> lock(map_cpu_param_tensor_mutex_);
map_cpu_param_tensor_.clear();
}
#if ENABLE_STAT
stat_export();
#endif
compressor_.reset();
sparse_optimizer_.reset();
thread_pool_.reset();
comm_manager_.reset();
lst_futures_.reset();
shm_manager_.reset();
}
However, the above modifications led to a new error: CUDA error: driver shutting down
Barrier OK
75%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊ | 3/4 [00:06<00:02, 2.86s/it, loss=258]Synchronize Backend, num_futures: 20044
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:07<00:00, 1.80s/it, loss=503]Synchronize Backend, num_futures: 22528
Terminating FasterDPEngine
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: driver shutting down
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at /opt/pytorch/pytorch/c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xae (0x79218efb295e in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xf3 (0x79218ef6b69d in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x3f2 (0x792191e20e12 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x14bdc (0x792191de9bdc in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #4: <unknown function> + 0x17fd8 (0x792191decfd8 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #5: <unknown function> + 0x1838c (0x792191ded38c in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #6: <unknown function> + 0x46eb2a (0x792148496b2a in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #7: c10::TensorImpl::~TensorImpl() + 0xd (0x79218ef8eead in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #8: std::_Hashtable<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, at::Tensor>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, at::Tensor> >, std::__detail::_Select1st, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::clear() + 0xda (0x7920e126e3aa in /home/stellatrain/explore-dp/backend/build/libfasterdp_core.so)
frame #9: FasterDpEngine::~FasterDpEngine() + 0x199 (0x7920e1266b29 in /home/stellatrain/explore-dp/backend/build/libfasterdp_core.so)
frame #10: <unknown function> + 0x45495 (0x7921929ad495 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #11: on_exit + 0 (0x7921929ad610 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #12: <unknown function> + 0x279d5b (0x576c74d04d5b in /usr/bin/python)
frame #13: <unknown function> + 0x26750f (0x576c74cf250f in /usr/bin/python)
frame #14: PyErr_PrintEx + 0x1d (0x576c74cf243d in /usr/bin/python)
frame #15: PyRun_SimpleStringFlags + 0x72 (0x576c74ce0e02 in /usr/bin/python)
frame #16: Py_RunMain + 0x375 (0x576c74cdfcf5 in /usr/bin/python)
frame #17: Py_BytesMain + 0x2d (0x576c74cb5bcd in /usr/bin/python)
frame #18: <unknown function> + 0x29d90 (0x792192991d90 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #19: __libc_start_main + 0x80 (0x792192991e40 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #20: _start + 0x25 (0x576c74cb5ac5 in /usr/bin/python)
I attempted to debug further with CUDA_LAUNCH_BLOCKING=1
but was unable to identify the root cause.
I would welcome any advice or thoughts on resolving this issue. Thank you for your time and help!
I'm still not sure what the core reason is. We used to have similar termination problems in intermediate versions, but we fixed those issues during the development process.
For a workaround, have you tried setting the number of epochs (--num-epochs
) to a number larger than 1? The issue might be caused by a cleanup timing problem, though the training itself may not have been affected.
Also, try delaying the termination of the Python process by inserting time.sleep(1)
in the python script (test_end_to_end.py).
Thank you very much for your suggestion, it was really helpful!
I tried setting the number of epochs to a value greater than 1, and also inserting time.sleep(1)
in test_end_to_end.py
, but neither of these methods worked.
However, I completely agree with your statement that "the issue might be caused by a cleanup timing problem, though the training itself may not have been affected." Therefore, I decided to continue with the subsequent ImageNet experiments for now.
After finishing the training, I saved the log file (/tmp/time_log_xxx_xxx.json
) and tried to analyze it. By reading your clear code, I was able to understand most of the log information. However, I have one small question: what does the Event
named Total
represent, and how is it used? The duration value for this event is significantly larger than the sum of all other events, and it doesn't seem to correspond to the total time of the entire Task
.
Here’s a snippet of the parsed log:
{
"0@0@weight": {
"CRIT_PATH_compress": 0.048,
"CRIT_PATH_gather_0": 0.01,
"CRIT_PATH_gather_1": 0.116,
"CRIT_PATH_optimize_raw": 0.005,
"CRIT_PATH_save_residual": 0.007,
"Compress": 1.429,
"CpuGather": 0.139,
"CpuGatherBarrier": 0.013,
"CpuOptimize": 0.067,
"CpuOptimizeBarrier": 0.006,
"D2HCopy": 1.521,
"D2HCopyBarrier": 82.522,
"GradExchange": 0.001,
"H2DCopy": 0.042,
"H2DCopyPre": 0.028,
"Total": 5281.933
}
}
Lastly, I would like to express my sincere gratitude for your help once again.
The total metric measures the time elapsed between the starting point in
and the ending point in
The value may appear larger during the first iteration due to some initialization overheads, so measurements from the first iteration are not very meaningful.
Got it, thank you very much!
Hello,
First, I would like to express my gratitude for open-sourcing this amazing project. I encountered the following error messages while trying to reproduce the results according to the
Run test script
section in theREADME
, which ultimately led to a program crash.Initially, I suspected that insufficient memory might be the cause, and I tried the following solutions:
Unfortunately, these attempts did not resolve the issue.
Experimental Setup
The experiment is set up on two nodes, each with two GPUs. Specific details are as follows:
Could you please assist me with this issue?
Thank you very much!