Closed gregtatum closed 5 months ago
I think I'm hitting a similar problem. My setup is quite a lot different than yours though. I used elfomal to align space-tokenized (in fact not tokenized at all) text, then OpusTrainer is supposed to retokenize it using SentencePiece model. I wonder if it's something new or the same bug you've encountered with our old alignments scheme and OpusTrainer with no modifiers.
[task 2024-03-16T14:21:13.879Z] [2024-03-16 14:21:13] [memory] Reserving 32 MB, device gpu0
[task 2024-03-16T14:21:13.883Z] [2024-03-16 14:21:13] Ep. 1 : Up. 1 : Sen. 760 : Cost 0.87012815 : Time 431.98s : 158.72 words/s : gNorm 3.8305 : L.r. 1.8750e-08
[task 2024-03-16T14:21:14.166Z] [2024-03-16 14:21:14] Ep. 1 : Up. 2 : Sen. 10,520 : Cost 4.55021906 : Time 0.28s : 448754.57 words/s : gNorm 4.8969 : L.r. 3.7500e-08
[task 2024-03-16T14:21:14.475Z] [2024-03-16 14:21:14] Ep. 1 : Up. 3 : Sen. 16,928 : Cost 3.01045871 : Time 0.31s : 435164.87 words/s : gNorm 4.6959 : L.r. 5.6250e-08
[task 2024-03-16T14:21:14.747Z] [2024-03-16 14:21:14] Ep. 1 : Up. 4 : Sen. 18,480 : Cost 1.29863644 : Time 0.27s : 331105.01 words/s : gNorm 4.3090 : L.r. 7.5000e-08
[task 2024-03-16T14:21:15.065Z] [2024-03-16 14:21:15] Ep. 1 : Up. 5 : Sen. 21,950 : Cost 2.17084908 : Time 0.32s : 469510.57 words/s : gNorm 4.5838 : L.r. 9.3750e-08
[task 2024-03-16T14:21:15.376Z] [2024-03-16 14:21:15] Ep. 1 : Up. 6 : Sen. 23,502 : Cost 1.19824779 : Time 0.31s : 354549.51 words/s : gNorm 4.3286 : L.r. 1.1250e-07
[task 2024-03-16T14:21:15.616Z] [2024-03-16 14:21:15] Ep. 1 : Up. 7 : Sen. 25,734 : Cost 1.84143591 : Time 0.24s : 306548.30 words/s : gNorm 4.2291 : L.r. 1.3125e-07
[task 2024-03-16T14:21:15.869Z] [2024-03-16 14:21:15] Ep. 1 : Up. 8 : Sen. 26,542 : Cost 0.93719184 : Time 0.25s : 271700.20 words/s : gNorm 4.0934 : L.r. 1.5000e-07
[task 2024-03-16T14:21:16.090Z] [2024-03-16 14:21:16] Ep. 1 : Up. 9 : Sen. 28,278 : Cost 1.63907695 : Time 0.22s : 298939.48 words/s : gNorm 4.0213 : L.r. 1.6875e-07
[task 2024-03-16T14:21:16.369Z] [2024-03-16 14:21:16] Ep. 1 : Up. 10 : Sen. 29,142 : Cost 0.90442389 : Time 0.28s : 288550.01 words/s : gNorm 4.0063 : L.r. 1.8750e-07
[task 2024-03-16T14:32:03.387Z] [2024-03-16 14:32:03] [Trainer] [WARNING] Skipping line because of exception: ValueError('Out-of-bound alignment pairs')
[task 2024-03-16T14:32:03.393Z] [2024-03-16 14:32:03] [Trainer] [WARNING] Skipping line because of exception: ValueError('Out-of-bound alignment pairs')
[task 2024-03-16T14:32:58.725Z] [2024-03-16 14:32:58] Ep. 1 : Up. 1000 : Sen. 3,819,068 : Cost 1.48244727 : Time 702.36s : 144022.15 words/s : gNorm 1.3209 : L.r. 1.8750e-05
[task 2024-03-16T14:43:24.199Z] [2024-03-16 14:43:24] [Trainer] [WARNING] Skipping line because of exception: ValueError('Out-of-bound alignment pairs')
[task 2024-03-16T14:43:24.205Z] [2024-03-16 14:43:24] [Trainer] [WARNING] Skipping line because of exception: ValueError('Out-of-bound alignment pairs')
[task 2024-03-16T14:47:07.294Z] [2024-03-16 14:47:07] Ep. 1 : Up. 2000 : Sen. 7,592,507 : Cost 1.09335911 : Time 848.57s : 121878.65 words/s : gNorm 1.4593 : L.r. 3.7500e-05
[task 2024-03-16T14:47:47.260Z] [2024-03-16 14:47:47] [Trainer] [WARNING] Skipping line because of exception: ValueError('Out-of-bound alignment pairs')
[task 2024-03-16T14:47:47.266Z] [2024-03-16 14:47:47] [Trainer] [WARNING] Skipping line because of exception: ValueError('Out-of-bound alignment pairs')
[task 2024-03-16T14:49:55.660Z] [2024-03-16 14:49:55] Error: Segmentation fault
[task 2024-03-16T14:49:55.660Z] [2024-03-16 14:49:55] Error: Aborted from setErrorHandlers()::<lambda(int, siginfo_t*, void*)> in /builds/worker/fetches/marian-source/src/common/logging.cpp:130
[task 2024-03-16T14:49:55.673Z]
[task 2024-03-16T14:49:55.673Z] [CALL STACK]
[task 2024-03-16T14:49:55.673Z] [0x5616e0584fc5] + 0x519fc5
[task 2024-03-16T14:49:55.673Z] [0x5616e058520f] + 0x51a20f
[task 2024-03-16T14:49:55.673Z] [0x7f8bca042520] + 0x42520
[task 2024-03-16T14:49:55.673Z] [0x5616e065e0f8] marian::data::CorpusBase:: addAlignmentsToBatch (std::shared_ptr<marian::data::CorpusBatch>, std::vector<marian::data::SentenceTuple,std::allocator<marian::data::SentenceTuple>> const&) + 0x438
[task 2024-03-16T14:49:55.673Z] [0x5616e0673562] marian::data::Corpus:: toBatch (std::vector<marian::data::SentenceTuple,std::allocator<marian::data::SentenceTuple>> const&) + 0x1252
[task 2024-03-16T14:49:55.674Z] [0x5616e055ada4] marian::data::BatchGenerator<marian::data::CorpusBase>:: fetchBatches () + 0x1204
[task 2024-03-16T14:49:55.674Z] [0x5616e055bab3] marian::ThreadPool::enqueue<marian::data::BatchGenerator<marian::data::CorpusBase>::fetchBatchesAsync()::{lambda()#1}>(marian::data::BatchGenerator<marian::data::CorpusBase>::fetchBatchesAsync()::{lambda()#1}&&)::{lambda()#1}:: operator() () const + 0x33
[task 2024-03-16T14:49:55.674Z] [0x5616e055ca21] std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base,std::__future_base::_Result_base::_Deleter> (),std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result<std::deque<std::shared_ptr<marian::data::CorpusBatch>,std::allocator<std::shared_ptr<marian::data::CorpusBatch>>>>,std::__future_base::_Result_base::_Deleter>,std::__future_base::_Task_state<marian::ThreadPool::enqueue<marian::data::BatchGenerator<marian::data::CorpusBase>::fetchBatchesAsync()::{lambda()#1}>(marian::data::BatchGenerator<marian::data::CorpusBase>::fetchBatchesAsync()::{lambda()#1}&&)::{lambda()#1},std::allocator<int>,std::deque<std::shared_ptr<marian::data::CorpusBatch>,std::allocator<std::shared_ptr<marian::data::CorpusBatch>>> ()>::_M_run()::{lambda()#1},std::deque<std::shared_ptr<marian::data::CorpusBatch>,std::allocator<std::shared_ptr<marian::data::CorpusBatch>>>>>:: _M_invoke (std::_Any_data const&) + 0x51
[task 2024-03-16T14:49:55.674Z] [0x5616e04861fd] std::__future_base::_State_baseV2:: _M_do_set (std::function<std::unique_ptr<std::__future_base::_Result_base,std::__future_base::_Result_base::_Deleter> ()>*, bool*) + 0x2d
[task 2024-03-16T14:49:55.674Z] [0x7f8bca099ee8] + 0x99ee8
[task 2024-03-16T14:49:55.674Z] [0x5616e048ad70] std::__future_base::_Task_state<marian::ThreadPool::enqueue<marian::data::BatchGenerator<marian::data::CorpusBase>::fetchBatchesAsync()::{lambda()#1}>(marian::data::BatchGenerator<marian::data::CorpusBase>::fetchBatchesAsync()::{lambda()#1}&&)::{lambda()#1},std::allocator<int>,std::deque<std::shared_ptr<marian::data::CorpusBatch>,std::allocator<std::shared_ptr<marian::data::CorpusBatch>>> ()>:: _M_run () + 0xf0
[task 2024-03-16T14:49:55.674Z] [0x5616e048bcd5] std::thread::_State_impl<std::thread::_Invoker<std::tuple<marian::ThreadPool::reserve(unsigned long)::{lambda()#1}>>>:: _M_run () + 0x1a5
[task 2024-03-16T14:49:55.674Z] [0x7f8bca4dc253] + 0xdc253
[task 2024-03-16T14:49:55.674Z] [0x7f8bca094ac3] + 0x94ac3
[task 2024-03-16T14:49:55.674Z] [0x7f8bca126850] + 0x126850
[task 2024-03-16T14:49:55.674Z]
[task 2024-03-16T14:49:56.803Z] [2024-03-16 14:49:56] [Trainer] [INFO] trainer stopped reading input
[fetches 2024-03-16T14:50:00.102Z] removing /home/ubuntu/tasks/task_171059821904953/fetches
[fetches 2024-03-16T14:50:02.054Z] finished
Ok, it has failed with only Tags modifier enabled in my branch. We can't remove it in the inline noise branch because it's supposed to remap the alignments. https://firefox-ci-tc.services.mozilla.com/tasks/dTCAmrqKQgCg3e6hSEiKVA/runs/0/logs/public/logs/live.log
datasets:
original: /home/ubuntu/tasks/task_171105412365522/fetches/corpus.lten.tsv # Original parallel corpus
stages:
- train
train:
- original 1.0
- until original inf # General training until marian early stops
modifiers:
#- UpperCase: 0.07 # Apply randomly to 7% of sentences
#- TitleCase: 0.05
#- Typos: 0.05
## inserts new noise sentences
#- Noise: 0.0005
# min_word_length: 2 # Minimum word length for each word in the noisy sentence
# max_word_length: 5 # Maximum word length for each word in the noisy sentence
# max_words: 6 # Maximum number of words in each noisy sentence
# generates inline noise (emojis etc.) matching position in source and target using alignments
# spm_vocab argument: retokenize alignments from spaces to Sentencepiece subwords and feed to Marian
# Tags modifier has to be the last one to retokenize the alignments
- Tags: 0.005
augment: 1
spm_vocab: /home/ubuntu/tasks/task_171105412365522/fetches/vocab.spm
seed: 1111
# parallel sentences + token alignments
num_fields: 3
I've found at least one bug in the implementation:
https://github.com/hplt-project/OpusTrainer/issues/53