OpusTrainer can produce incorrect alignments, breaking student training with guided alignments

gregtatum commented 6 months ago

I've found at least one bug in the implementation:

https://github.com/hplt-project/OpusTrainer/issues/53

eu9ene commented 6 months ago

I think I'm hitting a similar problem. My setup is quite a lot different than yours though. I used elfomal to align space-tokenized (in fact not tokenized at all) text, then OpusTrainer is supposed to retokenize it using SentencePiece model. I wonder if it's something new or the same bug you've encountered with our old alignments scheme and OpusTrainer with no modifiers.

https://firefox-ci-tc.services.mozilla.com/tasks/e4bBbBRbSZmTDtKsPJZMkQ/runs/0/logs/public/logs/live.log

[task 2024-03-16T14:21:13.879Z] [2024-03-16 14:21:13] [memory] Reserving 32 MB, device gpu0
[task 2024-03-16T14:21:13.883Z] [2024-03-16 14:21:13] Ep. 1 : Up. 1 : Sen. 760 : Cost 0.87012815 : Time 431.98s : 158.72 words/s : gNorm 3.8305 : L.r. 1.8750e-08
[task 2024-03-16T14:21:14.166Z] [2024-03-16 14:21:14] Ep. 1 : Up. 2 : Sen. 10,520 : Cost 4.55021906 : Time 0.28s : 448754.57 words/s : gNorm 4.8969 : L.r. 3.7500e-08
[task 2024-03-16T14:21:14.475Z] [2024-03-16 14:21:14] Ep. 1 : Up. 3 : Sen. 16,928 : Cost 3.01045871 : Time 0.31s : 435164.87 words/s : gNorm 4.6959 : L.r. 5.6250e-08
[task 2024-03-16T14:21:14.747Z] [2024-03-16 14:21:14] Ep. 1 : Up. 4 : Sen. 18,480 : Cost 1.29863644 : Time 0.27s : 331105.01 words/s : gNorm 4.3090 : L.r. 7.5000e-08
[task 2024-03-16T14:21:15.065Z] [2024-03-16 14:21:15] Ep. 1 : Up. 5 : Sen. 21,950 : Cost 2.17084908 : Time 0.32s : 469510.57 words/s : gNorm 4.5838 : L.r. 9.3750e-08
[task 2024-03-16T14:21:15.376Z] [2024-03-16 14:21:15] Ep. 1 : Up. 6 : Sen. 23,502 : Cost 1.19824779 : Time 0.31s : 354549.51 words/s : gNorm 4.3286 : L.r. 1.1250e-07
[task 2024-03-16T14:21:15.616Z] [2024-03-16 14:21:15] Ep. 1 : Up. 7 : Sen. 25,734 : Cost 1.84143591 : Time 0.24s : 306548.30 words/s : gNorm 4.2291 : L.r. 1.3125e-07
[task 2024-03-16T14:21:15.869Z] [2024-03-16 14:21:15] Ep. 1 : Up. 8 : Sen. 26,542 : Cost 0.93719184 : Time 0.25s : 271700.20 words/s : gNorm 4.0934 : L.r. 1.5000e-07
[task 2024-03-16T14:21:16.090Z] [2024-03-16 14:21:16] Ep. 1 : Up. 9 : Sen. 28,278 : Cost 1.63907695 : Time 0.22s : 298939.48 words/s : gNorm 4.0213 : L.r. 1.6875e-07
[task 2024-03-16T14:21:16.369Z] [2024-03-16 14:21:16] Ep. 1 : Up. 10 : Sen. 29,142 : Cost 0.90442389 : Time 0.28s : 288550.01 words/s : gNorm 4.0063 : L.r. 1.8750e-07
[task 2024-03-16T14:32:03.387Z] [2024-03-16 14:32:03] [Trainer] [WARNING] Skipping line because of exception: ValueError('Out-of-bound alignment pairs')
[task 2024-03-16T14:32:03.393Z] [2024-03-16 14:32:03] [Trainer] [WARNING] Skipping line because of exception: ValueError('Out-of-bound alignment pairs')
[task 2024-03-16T14:32:58.725Z] [2024-03-16 14:32:58] Ep. 1 : Up. 1000 : Sen. 3,819,068 : Cost 1.48244727 : Time 702.36s : 144022.15 words/s : gNorm 1.3209 : L.r. 1.8750e-05
[task 2024-03-16T14:43:24.199Z] [2024-03-16 14:43:24] [Trainer] [WARNING] Skipping line because of exception: ValueError('Out-of-bound alignment pairs')
[task 2024-03-16T14:43:24.205Z] [2024-03-16 14:43:24] [Trainer] [WARNING] Skipping line because of exception: ValueError('Out-of-bound alignment pairs')
[task 2024-03-16T14:47:07.294Z] [2024-03-16 14:47:07] Ep. 1 : Up. 2000 : Sen. 7,592,507 : Cost 1.09335911 : Time 848.57s : 121878.65 words/s : gNorm 1.4593 : L.r. 3.7500e-05
[task 2024-03-16T14:47:47.260Z] [2024-03-16 14:47:47] [Trainer] [WARNING] Skipping line because of exception: ValueError('Out-of-bound alignment pairs')
[task 2024-03-16T14:47:47.266Z] [2024-03-16 14:47:47] [Trainer] [WARNING] Skipping line because of exception: ValueError('Out-of-bound alignment pairs')
[task 2024-03-16T14:49:55.660Z] [2024-03-16 14:49:55] Error: Segmentation fault
[task 2024-03-16T14:49:55.660Z] [2024-03-16 14:49:55] Error: Aborted from setErrorHandlers()::<lambda(int, siginfo_t*, void*)> in /builds/worker/fetches/marian-source/src/common/logging.cpp:130
[task 2024-03-16T14:49:55.673Z] 
[task 2024-03-16T14:49:55.673Z] [CALL STACK]
[task 2024-03-16T14:49:55.673Z] [0x5616e0584fc5]                                                       + 0x519fc5
[task 2024-03-16T14:49:55.673Z] [0x5616e058520f]                                                       + 0x51a20f
[task 2024-03-16T14:49:55.673Z] [0x7f8bca042520]                                                       + 0x42520
[task 2024-03-16T14:49:55.673Z] [0x5616e065e0f8]    marian::data::CorpusBase::  addAlignmentsToBatch  (std::shared_ptr<marian::data::CorpusBatch>,  std::vector<marian::data::SentenceTuple,std::allocator<marian::data::SentenceTuple>> const&) + 0x438
[task 2024-03-16T14:49:55.673Z] [0x5616e0673562]    marian::data::Corpus::  toBatch  (std::vector<marian::data::SentenceTuple,std::allocator<marian::data::SentenceTuple>> const&) + 0x1252
[task 2024-03-16T14:49:55.674Z] [0x5616e055ada4]    marian::data::BatchGenerator<marian::data::CorpusBase>::  fetchBatches  () + 0x1204
[task 2024-03-16T14:49:55.674Z] [0x5616e055bab3]    marian::ThreadPool::enqueue<marian::data::BatchGenerator<marian::data::CorpusBase>::fetchBatchesAsync()::{lambda()#1}>(marian::data::BatchGenerator<marian::data::CorpusBase>::fetchBatchesAsync()::{lambda()#1}&&)::{lambda()#1}::  operator()  () const + 0x33
[task 2024-03-16T14:49:55.674Z] [0x5616e055ca21]    std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base,std::__future_base::_Result_base::_Deleter> (),std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result<std::deque<std::shared_ptr<marian::data::CorpusBatch>,std::allocator<std::shared_ptr<marian::data::CorpusBatch>>>>,std::__future_base::_Result_base::_Deleter>,std::__future_base::_Task_state<marian::ThreadPool::enqueue<marian::data::BatchGenerator<marian::data::CorpusBase>::fetchBatchesAsync()::{lambda()#1}>(marian::data::BatchGenerator<marian::data::CorpusBase>::fetchBatchesAsync()::{lambda()#1}&&)::{lambda()#1},std::allocator<int>,std::deque<std::shared_ptr<marian::data::CorpusBatch>,std::allocator<std::shared_ptr<marian::data::CorpusBatch>>> ()>::_M_run()::{lambda()#1},std::deque<std::shared_ptr<marian::data::CorpusBatch>,std::allocator<std::shared_ptr<marian::data::CorpusBatch>>>>>::  _M_invoke  (std::_Any_data const&) + 0x51
[task 2024-03-16T14:49:55.674Z] [0x5616e04861fd]    std::__future_base::_State_baseV2::  _M_do_set  (std::function<std::unique_ptr<std::__future_base::_Result_base,std::__future_base::_Result_base::_Deleter> ()>*,  bool*) + 0x2d
[task 2024-03-16T14:49:55.674Z] [0x7f8bca099ee8]                                                       + 0x99ee8
[task 2024-03-16T14:49:55.674Z] [0x5616e048ad70]    std::__future_base::_Task_state<marian::ThreadPool::enqueue<marian::data::BatchGenerator<marian::data::CorpusBase>::fetchBatchesAsync()::{lambda()#1}>(marian::data::BatchGenerator<marian::data::CorpusBase>::fetchBatchesAsync()::{lambda()#1}&&)::{lambda()#1},std::allocator<int>,std::deque<std::shared_ptr<marian::data::CorpusBatch>,std::allocator<std::shared_ptr<marian::data::CorpusBatch>>> ()>::  _M_run  () + 0xf0
[task 2024-03-16T14:49:55.674Z] [0x5616e048bcd5]    std::thread::_State_impl<std::thread::_Invoker<std::tuple<marian::ThreadPool::reserve(unsigned long)::{lambda()#1}>>>::  _M_run  () + 0x1a5
[task 2024-03-16T14:49:55.674Z] [0x7f8bca4dc253]                                                       + 0xdc253
[task 2024-03-16T14:49:55.674Z] [0x7f8bca094ac3]                                                       + 0x94ac3
[task 2024-03-16T14:49:55.674Z] [0x7f8bca126850]                                                       + 0x126850
[task 2024-03-16T14:49:55.674Z] 
[task 2024-03-16T14:49:56.803Z] [2024-03-16 14:49:56] [Trainer] [INFO] trainer stopped reading input
[fetches 2024-03-16T14:50:00.102Z] removing /home/ubuntu/tasks/task_171059821904953/fetches
[fetches 2024-03-16T14:50:02.054Z] finished

eu9ene commented 5 months ago

Ok, it has failed with only Tags modifier enabled in my branch. We can't remove it in the inline noise branch because it's supposed to remap the alignments. https://firefox-ci-tc.services.mozilla.com/tasks/dTCAmrqKQgCg3e6hSEiKVA/runs/0/logs/public/logs/live.log

datasets:
  original: /home/ubuntu/tasks/task_171105412365522/fetches/corpus.lten.tsv # Original parallel corpus

stages:
  - train

train:
  - original 1.0
  - until original inf # General training until marian early stops

modifiers:
#- UpperCase: 0.07 # Apply randomly to 7% of sentences
#- TitleCase: 0.05
#- Typos: 0.05
## inserts new noise sentences
#- Noise: 0.0005
#  min_word_length: 2 # Minimum word length for each word in the noisy sentence
#  max_word_length: 5 # Maximum word length for each word in the noisy sentence
#  max_words: 6 # Maximum number of words in each noisy sentence
# generates inline noise (emojis etc.) matching position in source and target using alignments
# spm_vocab argument: retokenize alignments from spaces to Sentencepiece subwords and feed to Marian
# Tags modifier has to be the last one to retokenize the alignments
- Tags: 0.005
  augment: 1
  spm_vocab: /home/ubuntu/tasks/task_171105412365522/fetches/vocab.spm

seed: 1111
# parallel sentences + token alignments
num_fields: 3

mozilla / firefox-translations-training

OpusTrainer can produce incorrect alignments, breaking student training with guided alignments #469