mozilla / firefox-translations-training

Training pipelines for Firefox Translations neural machine translation models
https://mozilla.github.io/firefox-translations-training/
Mozilla Public License 2.0
135 stars 28 forks source link

Marian error in translate-corpus #679

Open eu9ene opened 1 week ago

eu9ene commented 1 week ago

https://firefox-ci-tc.services.mozilla.com/tasks/AsVG4ziaTMKjYq6Z9fhgwg/runs/0/logs/public/logs/live.log https://firefox-ci-tc.services.mozilla.com/tasks/eScwZPfjS_yHCm6Gf4ufng/runs/0/logs/public/logs/live.log

[task 2024-06-17T06:16:05.464Z] [2024-06-17 06:16:05] [config] workspace: 12000
[task 2024-06-17T06:16:05.464Z] [2024-06-17 06:16:05] [config] Loaded model has been created with Marian v1.12.14 2d067af 2024-02-16 11:44:13 -0500
[task 2024-06-17T06:16:05.466Z] [2024-06-17 06:16:05] [data] Loading SentencePiece vocabulary from file /home/ubuntu/tasks/task_171860488402365/fetches/vocab.spm
[task 2024-06-17T06:16:05.516Z] [2024-06-17 06:16:05] [data] Loading SentencePiece vocabulary from file /home/ubuntu/tasks/task_171860488402365/fetches/vocab.spm
[task 2024-06-17T06:16:05.563Z] [2024-06-17 06:16:05] Loading model from /home/ubuntu/tasks/task_171860488402365/fetches/model1/final.model.npz.best-chrf.npz
[task 2024-06-17T06:16:07.806Z] [2024-06-17 06:16:07] Loading model from /home/ubuntu/tasks/task_171860488402365/fetches/model2/final.model.npz.best-chrf.npz
[task 2024-06-17T06:16:09.737Z] [2024-06-17 06:16:09] Error: Curand error 203 - /builds/worker/fetches/marian-source/src/tensors/rand.cpp:74: curandCreateGenerator(&generator_, CURAND_RNG_PSEUDO_DEFAULT)
[task 2024-06-17T06:16:09.737Z] [2024-06-17 06:16:09] Error: Curand error 203 - /builds/worker/fetches/marian-source/src/tensors/rand.cpp:74: curandCreateGenerator(&generator_, CURAND_RNG_PSEUDO_DEFAULT)
[task 2024-06-17T06:16:09.737Z] [2024-06-17 06:16:09] Error: Curand error 203 - /builds/worker/fetches/marian-source/src/tensors/rand.cpp:74: curandCreateGenerator(&generator_, CURAND_RNG_PSEUDO_DEFAULT)
[task 2024-06-17T06:16:09.737Z] [2024-06-17 06:16:09] Error: Aborted from marian::CurandRandomGenerator::CurandRandomGenerator(size_t, marian::DeviceId) in /builds/worker/fetches/marian-source/src/tensors/rand.cpp:74
[task 2024-06-17T06:16:09.737Z] Aborted from marian::CurandRandomGenerator::CurandRandomGenerator(size_t, marian::DeviceId) in /builds/worker/fetches/marian-source/src/tensors/rand.cpp:74
[task 2024-06-17T06:16:09.737Z] [2024-06-17 06:16:09] Error: Curand error 203 - /builds/worker/fetches/marian-source/src/tensors/rand.cpp:74: curandCreateGenerator(&generator_, CURAND_RNG_PSEUDO_DEFAULT)
[task 2024-06-17T06:16:09.737Z] [2024-06-17 06:16:09] Error: Aborted from marian::CurandRandomGenerator::CurandRandomGenerator(size_t, marian::DeviceId) in /builds/worker/fetches/marian-source/src/tensors/rand.cpp:74
[task 2024-06-17T06:16:09.737Z] [2024-06-17 06:16:09] Error: Aborted from marian::CurandRandomGenerator::CurandRandomGenerator(size_t, marian::DeviceId) in /builds/worker/fetches/marian-source/src/tensors/rand.cpp:74
[task 2024-06-17T06:16:09.794Z] 
[task 2024-06-17T06:16:09.794Z] [CALL STACK]
[task 2024-06-17T06:16:09.794Z] [0x64599fabf1af]    marian::CurandRandomGenerator::  CurandRandomGenerator  (unsigned long,  marian::DeviceId) + 0x83f
[task 2024-06-17T06:16:09.794Z] [0x64599fabf849]    marian::  createRandomGenerator  (unsigned long,  marian::DeviceId) + 0x69
[task 2024-06-17T06:16:09.794Z] [0x64599fab8f40]    marian::  BackendByDeviceId  (marian::DeviceId,  unsigned long) + 0xa0
[task 2024-06-17T06:16:09.794Z] [0x64599f7b44f0]    marian::ExpressionGraph::  setDevice  (marian::DeviceId,  std::shared_ptr<marian::Device>) + 0x80
[task 2024-06-17T06:16:09.794Z] [0x64599f639f98]    marian::Translate<marian::BeamSearch>::Translate(std::shared_ptr<marian::Options>)::{lambda(marian::DeviceId,unsigned long)#1}::  operator()  (marian::DeviceId,  unsigned long) const + 0x1d8
[task 2024-06-17T06:16:09.794Z] [0x64599f63b089]    marian::ThreadPool::enqueue<marian::Translate<marian::BeamSearch>::Translate(std::shared_ptr<marian::Options>)::{lambda(marian::DeviceId,unsigned long)#1}&,marian::DeviceId&,unsigned long>(marian::Translate<marian::BeamSearch>::Translate(std::shared_ptr<marian::Options>)::{lambda(marian::DeviceId,unsigned long)#1}&,marian::DeviceId&,unsigned long&&)::{lambda()#1}::  operator()  () const + 0x39
[task 2024-06-17T06:16:09.794Z] [0x64599f63bea0]    std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base,std::__future_base::_Result_base::_Deleter> (),std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result<void>,std::__future_base::_Result_base::_Deleter>,std::__future_base::_Task_state<marian::ThreadPool::enqueue<marian::Translate<marian::BeamSearch>::Translate(std::shared_ptr<marian::Options>)::{lambda(marian::DeviceId,unsigned long)#1}&,marian::DeviceId&,unsigned long>(marian::Translate<marian::BeamSearch>::Translate(std::shared_ptr<marian::Options>)::{lambda(marian::DeviceId,unsigned long)#1}&,marian::DeviceId&,unsigned long&&)::{lambda()#1},std::allocator<int>,void ()>::_M_run()::{lambda()#1},void>>::  _M_invoke  (std::_Any_data const&) + 0x30
[task 2024-06-17T06:16:09.794Z] [0x64599f5ea48d]    std::__future_base::_State_baseV2::  _M_do_set  (std::function<std::unique_ptr<std::__future_base::_Result_base,std::__future_base::_Result_base::_Deleter> ()>*,  bool*) + 0x2d
[task 2024-06-17T06:16:09.794Z] [0x7383d3099ee8]                                                       + 0x99ee8
[task 2024-06-17T06:16:09.794Z] [0x64599f5eb720]    std::__future_base::_Task_state<marian::ThreadPool::enqueue<marian::Translate<marian::BeamSearch>::Translate(std::shared_ptr<marian::Options>)::{lambda(marian::DeviceId,unsigned long)#1}&,marian::DeviceId&,unsigned long>(marian::Translate<marian::BeamSearch>::Translate(std::shared_ptr<marian::Options>)::{lambda(marian::DeviceId,unsigned long)#1}&,marian::DeviceId&,unsigned long&&)::{lambda()#1},std::allocator<int>,void ()>::  _M_run  () + 0xf0
[task 2024-06-17T06:16:09.794Z] [0x64599f5ecd65]    std::thread::_State_impl<std::thread::_Invoker<std::tuple<marian::ThreadPool::reserve(unsigned long)::{lambda()#1}>>>::  _M_run  () + 0x1a5
[task 2024-06-17T06:16:09.794Z] [0x7383d34dc253]                                                       + 0xdc253
[task 2024-06-17T06:16:09.794Z] [0x7383d3094ac3]                                                       + 0x94ac3
[task 2024-06-17T06:16:09.794Z] [0x7383d3126850]                                                       + 0x126850
[task 2024-06-17T06:16:09.794Z] 
[task 2024-06-17T06:16:10.261Z] /home/ubuntu/tasks/task_171860488402365/checkouts/vcs/pipeline/translate/translate-nbest.sh: line 28: 37694 Aborted                 (core dumped) "${MARIAN}/marian-decoder" -c decoder.yml -m "${models[@]}" -v "${vocab}" "${vocab}" -i "${input}" -o "${input}.nbest" --log "${input}.log" --n-best -d ${GPUS} -w "${WORKSPACE}"
[fetches 2024-06-17T06:16:10.262Z] removing /home/ubuntu/tasks/task_171860488402365/fetches
[fetches 2024-06-17T06:16:12.583Z] finished
[taskcluster 2024-06-17T06:16:12.594Z]    Exit Code: 134
eu9ene commented 1 week ago

Also in translate-mono:

https://firefox-ci-tc.services.mozilla.com/tasks/RkffbHjwRbGnqyg9DFPvSw https://firefox-ci-tc.services.mozilla.com/tasks/W8LNkfl2SZ-KGTC_1Kvnsw https://firefox-ci-tc.services.mozilla.com/tasks/SiTasxPlRImN5a3gsonI7g https://firefox-ci-tc.services.mozilla.com/tasks/ZBHiCknmTayXXM2pOqvVPQ

eu9ene commented 1 week ago

The tasks pass on rerun. It's probably something with randomization or infrastructure. We should add automatic restarts for these tasks