kaist-ina / stellatrain

Official Github repository for the SIGCOMM '24 paper "Accelerating Model Training in Multi-cluster Environments with Consumer-grade GPUs"
51 stars 8 forks source link

Segmentation Fault During Test Script Execution #1

Open kuilz opened 1 month ago

kuilz commented 1 month ago

Hello,

First, I would like to express my gratitude for open-sourcing this amazing project. I encountered the following error messages while trying to reproduce the results according to the Run test script section in the README, which ultimately led to a program crash.

[gpu2-System-Product-Name:306  :0:306] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
[gpu2-System-Product-Name:305  :0:305] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
32 0x000000000022aac5 _start()  ???:0
=================================
Traceback (most recent call last):
  File "/home/stellatrain/explore-dp/backend/test/test_end_to_end.py", line 160, in <module>
    mp.spawn(method,
  File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 239, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 197, in start_processes
    while not context.join():
  File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 140, in join
    raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 1 terminated with signal SIGSEGV

Initially, I suspected that insufficient memory might be the cause, and I tried the following solutions:

  1. Reduced the batch size.
  2. Run Docker with increased memory and shared memory settings.

Unfortunately, these attempts did not resolve the issue.


Experimental Setup

The experiment is set up on two nodes, each with two GPUs. Specific details are as follows:


Could you please assist me with this issue?

Thank you very much!

wjuni commented 3 weeks ago

Have you tried building the container in your environment? The built container might not be compatible in every environment.

kuilz commented 3 weeks ago

Thank you very much for your suggestion.

I have tried building the image in my own environment by the following cmd:

  1. Navigate to the root directory of the project (where the Dockerfile is located).
  2. Run the following command:
    docker build -t my_st_img .

However, I still encountered the same error, which seem to occur during the synchronization backend process. Detailed output of the error is shown below:

Received from master : key=56@bias
Stashing data for 56@bias
Barrier OK
Barrier OK
 99%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊ | 127/128 [00:50<00:00,  2.63it/s, loss=1.97e+3]Synchronize Backend, num_futures: 16099
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 128/128 [00:50<00:00,  2.52it/s, loss=1.98e+3]Synchronize Backend, num_futures: 17050
Terminating FasterDPEngine
Terminating FasterDPEngine
Writing statistics at /tmp/time_log_20241106054943_1107.json, last event was 127@53@conv2.weight
Writing statistics at /tmp/time_log_20241106054943_1106.json, last event was 127@53@conv2.weight
[gpu2-System-Product-Name:1107 :0:1107] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
[gpu2-System-Product-Name:1106 :0:1106] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
==== backtrace (tid:   1106) ====
 0 0x0000000000042520 __sigaction()  ???:0
 1 0x000000000004ac11 c10::cuda::CUDAKernelLaunchRegistry::has_failed()  ???:0
 2 0x000000000004ba9d c10::cuda::c10_cuda_check_implementation()  ???:0
 3 0x0000000000014bdc c10::cuda::CUDACachingAllocator::Native::DeviceCachingAllocator::insert_events()  CUDACachingAllocator.cpp:0
 4 0x0000000000017fd8 c10::cuda::CUDACachingAllocator::Native::DeviceCachingAllocator::free()  CUDACachingAllocator.cpp:0
 5 0x000000000001838c c10::cuda::CUDACachingAllocator::Native::local_raw_delete()  :0
 6 0x000000000046eb2a c10::StorageImpl::~StorageImpl()  :0
 7 0x0000000000044ead c10::TensorImpl::~TensorImpl()  TensorImpl.cpp:0
 8 0x00000000000515aa c10::intrusive_ptr<c10::TensorImpl, c10::UndefinedTensorImpl>::reset_()  /usr/local/lib/python3.10/dist-packages/torch/include/c10/util/intrusive_ptr.h:291
 9 0x00000000000515aa c10::intrusive_ptr<c10::TensorImpl, c10::UndefinedTensorImpl>::reset_()  /usr/local/lib/python3.10/dist-packages/torch/include/c10/util/intrusive_ptr.h:274
10 0x00000000000515aa c10::intrusive_ptr<c10::TensorImpl, c10::UndefinedTensorImpl>::~intrusive_ptr()  /usr/local/lib/python3.10/dist-packages/torch/include/c10/util/intrusive_ptr.h:370
11 0x00000000000515aa at::TensorBase::~TensorBase()  /usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBase.h:80
12 0x00000000000515aa at::Tensor::~Tensor()  /usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:90
13 0x00000000000515aa std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, at::Tensor>::~pair()  /usr/include/c++/11/bits/stl_pair.h:211
14 0x00000000000515aa __gnu_cxx::new_allocator<std::__detail::_Hash_node<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, at::Tensor>, true> >::destroy<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, at::Tensor> >()  /usr/include/c++/11/ext/new_allocator.h:168
15 0x00000000000515aa std::allocator_traits<std::allocator<std::__detail::_Hash_node<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, at::Tensor>, true> > >::destroy<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, at::Tensor> >()  /usr/include/c++/11/bits/alloc_traits.h:535
16 0x00000000000515aa std::__detail::_Hashtable_alloc<std::allocator<std::__detail::_Hash_node<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, at::Tensor>, true> > >::_M_deallocate_node()  /usr/include/c++/11/bits/hashtable_policy.h:1894
17 0x00000000000515aa std::__detail::_Hashtable_alloc<std::allocator<std::__detail::_Hash_node<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, at::Tensor>, true> > >::_M_deallocate_nodes()  /usr/include/c++/11/bits/hashtable_policy.h:1916
18 0x00000000000515aa std::_Hashtable<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, at::Tensor>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, at::Tensor> >, std::__detail::_Select1st, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::clear()  /usr/include/c++/11/bits/hashtable.h:2320
19 0x00000000000515aa std::_Hashtable<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, at::Tensor>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, at::Tensor> >, std::__detail::_Select1st, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::~_Hashtable()  /usr/include/c++/11/bits/hashtable.h:1532
20 0x000000000004beaf std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, at::Tensor, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, at::Tensor> > >::~unordered_map()  /usr/include/c++/11/bits/unordered_map.h:102
21 0x000000000004beaf FasterDpEngine::~FasterDpEngine()  /home/stellatrain/explore-dp/backend/src/engine/core.cpp:61
22 0x0000000000045495 secure_getenv()  ???:0
23 0x0000000000045610 exit()  ???:0
24 0x0000000000279d5b Py_Exit()  ???:0
25 0x000000000026750f PyGC_Collect()  ???:0
26 0x000000000026743d PyErr_PrintEx()  ???:0
27 0x0000000000255e02 PyRun_SimpleStringFlags()  ???:0
28 0x0000000000254cf5 Py_RunMain()  ???:0
29 0x000000000022abcd Py_BytesMain()  ???:0
30 0x0000000000029d90 __libc_init_first()  ???:0
31 0x0000000000029e40 __libc_start_main()  ???:0
32 0x000000000022aac5 _start()  ???:0
=================================
==== backtrace (tid:   1107) ====
 0 0x0000000000042520 __sigaction()  ???:0
 1 0x000000000004ac11 c10::cuda::CUDAKernelLaunchRegistry::has_failed()  ???:0
 2 0x000000000004ba9d c10::cuda::c10_cuda_check_implementation()  ???:0
 3 0x0000000000014bdc c10::cuda::CUDACachingAllocator::Native::DeviceCachingAllocator::insert_events()  CUDACachingAllocator.cpp:0
 4 0x0000000000017fd8 c10::cuda::CUDACachingAllocator::Native::DeviceCachingAllocator::free()  CUDACachingAllocator.cpp:0
 5 0x000000000001838c c10::cuda::CUDACachingAllocator::Native::local_raw_delete()  :0
 6 0x000000000046eb2a c10::StorageImpl::~StorageImpl()  :0
 7 0x0000000000044ead c10::TensorImpl::~TensorImpl()  TensorImpl.cpp:0
 8 0x00000000000515aa c10::intrusive_ptr<c10::TensorImpl, c10::UndefinedTensorImpl>::reset_()  /usr/local/lib/python3.10/dist-packages/torch/include/c10/util/intrusive_ptr.h:291
 9 0x00000000000515aa c10::intrusive_ptr<c10::TensorImpl, c10::UndefinedTensorImpl>::reset_()  /usr/local/lib/python3.10/dist-packages/torch/include/c10/util/intrusive_ptr.h:274
10 0x00000000000515aa c10::intrusive_ptr<c10::TensorImpl, c10::UndefinedTensorImpl>::~intrusive_ptr()  /usr/local/lib/python3.10/dist-packages/torch/include/c10/util/intrusive_ptr.h:370
11 0x00000000000515aa at::TensorBase::~TensorBase()  /usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBase.h:80
12 0x00000000000515aa at::Tensor::~Tensor()  /usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:90
13 0x00000000000515aa std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, at::Tensor>::~pair()  /usr/include/c++/11/bits/stl_pair.h:211
14 0x00000000000515aa __gnu_cxx::new_allocator<std::__detail::_Hash_node<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, at::Tensor>, true> >::destroy<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, at::Tensor> >()  /usr/include/c++/11/ext/new_allocator.h:168
15 0x00000000000515aa std::allocator_traits<std::allocator<std::__detail::_Hash_node<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, at::Tensor>, true> > >::destroy<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, at::Tensor> >()  /usr/include/c++/11/bits/alloc_traits.h:535
16 0x00000000000515aa std::__detail::_Hashtable_alloc<std::allocator<std::__detail::_Hash_node<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, at::Tensor>, true> > >::_M_deallocate_node()  /usr/include/c++/11/bits/hashtable_policy.h:1894
17 0x00000000000515aa std::__detail::_Hashtable_alloc<std::allocator<std::__detail::_Hash_node<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, at::Tensor>, true> > >::_M_deallocate_nodes()  /usr/include/c++/11/bits/hashtable_policy.h:1916
18 0x00000000000515aa std::_Hashtable<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, at::Tensor>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, at::Tensor> >, std::__detail::_Select1st, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::clear()  /usr/include/c++/11/bits/hashtable.h:2320
19 0x00000000000515aa std::_Hashtable<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, at::Tensor>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, at::Tensor> >, std::__detail::_Select1st, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::~_Hashtable()  /usr/include/c++/11/bits/hashtable.h:1532
20 0x000000000004beaf std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, at::Tensor, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, at::Tensor> > >::~unordered_map()  /usr/include/c++/11/bits/unordered_map.h:102
21 0x000000000004beaf FasterDpEngine::~FasterDpEngine()  /home/stellatrain/explore-dp/backend/src/engine/core.cpp:61
22 0x0000000000045495 secure_getenv()  ???:0
23 0x0000000000045610 exit()  ???:0
24 0x0000000000279d5b Py_Exit()  ???:0
25 0x000000000026750f PyGC_Collect()  ???:0
26 0x000000000026743d PyErr_PrintEx()  ???:0
27 0x0000000000255e02 PyRun_SimpleStringFlags()  ???:0
28 0x0000000000254cf5 Py_RunMain()  ???:0
29 0x000000000022abcd Py_BytesMain()  ???:0
30 0x0000000000029d90 __libc_init_first()  ???:0
31 0x0000000000029e40 __libc_start_main()  ???:0
32 0x000000000022aac5 _start()  ???:0
=================================
Traceback (most recent call last):
  File "/home/stellatrain/explore-dp/backend/test/test_end_to_end.py", line 160, in <module>
    mp.spawn(method,
  File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 239, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 197, in start_processes
    while not context.join():
  File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 140, in join
    raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 1 terminated with signal SIGSEGV
stellatrain@gpu2-System-Product-Name:/home/stellatrain/explore-dp# 

Any insights or suggestions you could provide would be greatly appreciated.

kuilz commented 3 weeks ago

Here is some additional details about bugs.

  1. Despite the epoches number, every epoch finished.

    • epoch=1
1 2
  1. After finishing all the epoches, the process of syschronization failed.

    • According to the code, we observe that syschronization is executed immediately after finishing all epochs.
3 4
kuilz commented 2 weeks ago

Based on the complete error messages above, I have noticed the following information:

19 0x00000000000515aa std::_Hashtable<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, at::Tensor>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, at::Tensor> >, std::__detail::_Select1st, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::~_Hashtable()  /usr/include/c++/11/bits/hashtable.h:1532
20 0x000000000004beaf std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, at::Tensor, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, at::Tensor> > >::~unordered_map()  /usr/include/c++/11/bits/unordered_map.h:102
21 0x000000000004beaf FasterDpEngine::~FasterDpEngine()  /home/stellatrain/explore-dp/backend/src/engine/core.cpp:61
22 0x0000000000045495 secure_getenv()  ???:0
23 0x0000000000045610 exit()  ???:0
24 0x0000000000279d5b Py_Exit()  ???:0
25 0x000000000026750f PyGC_Collect()  ???:0
26 0x000000000026743d PyErr_PrintEx()  ???:0
27 0x0000000000255e02 PyRun_SimpleStringFlags()  ???:0
28 0x0000000000254cf5 Py_RunMain()  ???:0
29 0x000000000022abcd Py_BytesMain()  ???:0
30 0x0000000000029d90 __libc_init_first()  ???:0
31 0x0000000000029e40 __libc_start_main()  ???:0
32 0x000000000022aac5 _start()  ???:0
=================================

From this section of the backtrace, it appears that the error occurs in the destructor of FasterDpEngine, specifically during the cleanup of Tensor objects in an unordered_map, where a segmentation fault is triggered.

Possible causes:

  1. Some Tensors may have been prematurely released elsewhere, resulting in a double-free issue.
  2. Race conditions during resource cleanup in a multithreaded environment.
  3. Incorrect memory release order for some CUDA Tensors when the program exits.

Attempted solutions:

  1. Explicitly call reset() on smart pointers after stopping threads to release them.
  2. Add cudaDeviceSynchronize() to ensure all CUDA operations are completed.
  3. Clean up each tensor-holding map in order with mutex protection.
  4. Explicitly clear other components that may hold references to Tensors: compressor, optimizer, thread pool.

Updated code:

FasterDpEngine::~FasterDpEngine() {
    std::cout << "Terminating FasterDPEngine" << std::endl;

    finished_ = true;

    if (barrier_manager_thread_ != nullptr) {
        pthread_cond_broadcast(&shared_props_->barrier_ipc_cond_);
        barrier_manager_thread_->join();
        barrier_manager_thread_.reset();
    }

    if (chore_manager_thread_ != nullptr) {
        finished_cond_.notify_all();
        chore_manager_thread_->join();
        chore_manager_thread_.reset();
    }

    if (model_complete_manager_thread_ != nullptr) {
        layer_model_completed_version_map_cond_.notify_all();
        model_complete_manager_thread_->join();
        model_complete_manager_thread_.reset();
    }

    if (cpu_shmem_return_manager_thread_ != nullptr) {
        cpu_shmem_use_map_cond_.notify_all();
        cpu_shmem_return_manager_thread_->join();
        cpu_shmem_return_manager_thread_.reset();
    }

    if (backward_delegate_thread_ != nullptr) {
        backward_delegate_cond_.notify_all();
        backward_delegate_thread_->join();
        backward_delegate_thread_.reset();
    }

    cudaDeviceSynchronize();

    {
        std::lock_guard<std::mutex> lock(map_gpu_param_tensor_mutex_);
        map_gpu_param_tensor_.clear();
    }

    {
        std::lock_guard<std::mutex> lock(map_gpu_grad_tensor_mutex_);
        map_gpu_grad_tensor_.clear();
    }

    {
        std::lock_guard<std::mutex> lock(map_cpu_param_tensor_mutex_);
        map_cpu_param_tensor_.clear();
    }

#if ENABLE_STAT
    stat_export();
#endif

    compressor_.reset();
    sparse_optimizer_.reset();
    thread_pool_.reset();
    comm_manager_.reset();
    lst_futures_.reset();
    shm_manager_.reset();
}

However, the above modifications led to a new error: CUDA error: driver shutting down

Barrier OK
 75%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊                                         | 3/4 [00:06<00:02,  2.86s/it, loss=258]Synchronize Backend, num_futures: 20044
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:07<00:00,  1.80s/it, loss=503]Synchronize Backend, num_futures: 22528
Terminating FasterDPEngine
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: driver shutting down
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at /opt/pytorch/pytorch/c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xae (0x79218efb295e in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xf3 (0x79218ef6b69d in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x3f2 (0x792191e20e12 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x14bdc (0x792191de9bdc in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #4: <unknown function> + 0x17fd8 (0x792191decfd8 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #5: <unknown function> + 0x1838c (0x792191ded38c in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #6: <unknown function> + 0x46eb2a (0x792148496b2a in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #7: c10::TensorImpl::~TensorImpl() + 0xd (0x79218ef8eead in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #8: std::_Hashtable<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, at::Tensor>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, at::Tensor> >, std::__detail::_Select1st, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::clear() + 0xda (0x7920e126e3aa in /home/stellatrain/explore-dp/backend/build/libfasterdp_core.so)
frame #9: FasterDpEngine::~FasterDpEngine() + 0x199 (0x7920e1266b29 in /home/stellatrain/explore-dp/backend/build/libfasterdp_core.so)
frame #10: <unknown function> + 0x45495 (0x7921929ad495 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #11: on_exit + 0 (0x7921929ad610 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #12: <unknown function> + 0x279d5b (0x576c74d04d5b in /usr/bin/python)
frame #13: <unknown function> + 0x26750f (0x576c74cf250f in /usr/bin/python)
frame #14: PyErr_PrintEx + 0x1d (0x576c74cf243d in /usr/bin/python)
frame #15: PyRun_SimpleStringFlags + 0x72 (0x576c74ce0e02 in /usr/bin/python)
frame #16: Py_RunMain + 0x375 (0x576c74cdfcf5 in /usr/bin/python)
frame #17: Py_BytesMain + 0x2d (0x576c74cb5bcd in /usr/bin/python)
frame #18: <unknown function> + 0x29d90 (0x792192991d90 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #19: __libc_start_main + 0x80 (0x792192991e40 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #20: _start + 0x25 (0x576c74cb5ac5 in /usr/bin/python)

I attempted to debug further with CUDA_LAUNCH_BLOCKING=1 but was unable to identify the root cause.

I would welcome any advice or thoughts on resolving this issue. Thank you for your time and help!

wjuni commented 2 weeks ago

I'm still not sure what the core reason is. We used to have similar termination problems in intermediate versions, but we fixed those issues during the development process. For a workaround, have you tried setting the number of epochs (--num-epochs) to a number larger than 1? The issue might be caused by a cleanup timing problem, though the training itself may not have been affected. Also, try delaying the termination of the Python process by inserting time.sleep(1) in the python script (test_end_to_end.py).

kuilz commented 2 weeks ago

Thank you very much for your suggestion, it was really helpful!

I tried setting the number of epochs to a value greater than 1, and also inserting time.sleep(1) in test_end_to_end.py, but neither of these methods worked.

However, I completely agree with your statement that "the issue might be caused by a cleanup timing problem, though the training itself may not have been affected." Therefore, I decided to continue with the subsequent ImageNet experiments for now.

After finishing the training, I saved the log file (/tmp/time_log_xxx_xxx.json) and tried to analyze it. By reading your clear code, I was able to understand most of the log information. However, I have one small question: what does the Event named Total represent, and how is it used? The duration value for this event is significantly larger than the sum of all other events, and it doesn't seem to correspond to the total time of the entire Task.

Here’s a snippet of the parsed log:

{
    "0@0@weight": {
        "CRIT_PATH_compress": 0.048,
        "CRIT_PATH_gather_0": 0.01,
        "CRIT_PATH_gather_1": 0.116,
        "CRIT_PATH_optimize_raw": 0.005,
        "CRIT_PATH_save_residual": 0.007,
        "Compress": 1.429,
        "CpuGather": 0.139,
        "CpuGatherBarrier": 0.013,
        "CpuOptimize": 0.067,
        "CpuOptimizeBarrier": 0.006,
        "D2HCopy": 1.521,
        "D2HCopyBarrier": 82.522,
        "GradExchange": 0.001,
        "H2DCopy": 0.042,
        "H2DCopyPre": 0.028,
        "Total": 5281.933
    }
}

Lastly, I would like to express my sincere gratitude for your help once again.

wjuni commented 2 weeks ago

The total metric measures the time elapsed between the starting point in

https://github.com/kaist-ina/stellatrain/blob/6dccb2dad957598605fc9138c7160ae5809f5e9c/backend/src/engine/modules/d2h_copy.cpp#L24

and the ending point in

https://github.com/kaist-ina/stellatrain/blob/6dccb2dad957598605fc9138c7160ae5809f5e9c/backend/src/engine/modules/h2d_copy.cpp#L65

The value may appear larger during the first iteration due to some initialization overheads, so measurements from the first iteration are not very meaningful.

kuilz commented 2 weeks ago

Got it, thank you very much!