Don't load entire corpus into memory on start up (enhancement request)

bhaddow commented 6 years ago

GPU servers often don't have much RAM, so loading the full corpus into memory for shuffling limits the size of the corpus that can be used for training. Particularly if there are other jobs on the machine. Or if you want to start several marian jobs at a time.

I realise that this got discussed in other thread, but I think it was OT, and the thread was closed as the main issue was fixed.

emjotde commented 6 years ago

Agreed. Only problem is this requires quite some amount of "boring" programming. I suppose we would need to use random number prefixes on parallel lines and then do a disk-based sort to make that not in-memory. Any takers?

emjotde commented 6 years ago

BTW. this is confined to src/data/corpus.{h,cpp}, so not the most difficult task in the world and a good starting point for future contributors [hint, hint].

bhaddow commented 6 years ago

So you could load all the beginning-of-line offsets into memory and shuffle them. Or would the random access be a problem?

emjotde commented 6 years ago

Memory mapping the text files maybe and use a vector of StringPiece as a corpus, that could work. Do not know how much of a problem the random access would be. With time, the operation system should take care of that and at least part of the corpus should remain in memory.

emjotde commented 6 years ago

@kpu if you do random access on memory-mapped files, is it buffering data around the access?

kpu commented 6 years ago

If lazily mmapped, performance is going to be terrible for random access. If MAP_POPULATE then it's going to be in memory anyway.

Pages will be mapped at a time: 4k minimum. But by the end you will have read everything into RAM and created pressure to swap things out.

I feel like there's an external memory way to do this: open n temporary files, route each line to a random file, open file 0 and shuffle in RAM then use, open file 1 and shuffle, etc. Just have to get the distributions right.

hieuhoang commented 6 years ago

This is a good opportunity to build a framework thAt can go beyond text data. Multimodal translation, captioning, speech translation etc

Hieu Hoang Sent while bumping into things

On 3 Dec 2017 7:24 p.m., "Kenneth Heafield" notifications@github.com wrote:

If lazily mmapped, performance is going to be terrible for random access. If MAP_POPULATE then it's going to be in memory anyway.

Pages will be mapped at a time: 4k minimum. But by the end you will have read everything into RAM and created pressure to swap things out.

I feel like there's an external memory way to do this: open n temporary files, route each line to a random file, open file 0 and shuffle, open file 1 and shuffle, etc. Just have to get the distributions right.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/marian-nmt/marian-dev/issues/148#issuecomment-348807321, or mute the thread https://github.com/notifications/unsubscribe-auth/AAqOFAmqu0uIg5VgS4g8RXejXTm1ZRf_ks5s8vWCgaJpZM4QzrXr .

emjotde commented 6 years ago

Despite that I would be happy with a dirty trick that solves the op's request, as it is a reasonable one.

emjotde commented 6 years ago

@kpu that does not sound too bad. Will take a look at it.

bhaddow commented 6 years ago

I have an implementation which works - it just splits the training file randomly into pieces on disk and then shuffles each piece in memory before concatenating. It currently lacks a short circuit (for the no-split case) and a means of configuring split size.

I also haven't done anything with the ids_ variable in Corpus. These appears to track the original position of the sentence, but I can't see it being used anywhere except for debug. It is it necessary? It's more awkward to track it over the split files.

emjotde commented 6 years ago

Thanks for that.

The indices are used for training with guided alignment and we are going to need them for a couple of more things very soon for our grammar correction stuff. At this point I would rather keep them

emjotde commented 6 years ago

As for the indices, I think I may be a lot easier to keep them in RAM instead of trying to put them in the files too. Even for large corpora this will be manageable.

Just have a std::vector<std::vector<size_t>> where you record to which files each index went. Later when sorting the file in RAM, just recreate the same order in the corresponding std::vector<size_t>.

ghost commented 6 years ago

@bhaddow How large is the corpus you are dealing with? I would like to test how SQLite behaves for larger corpora. This would also have the advantage that the training corpus could be modified during training.

bhaddow commented 6 years ago

I've tried up to 390M sentences. Actually this arises from trying to mix 2 corpora in a specific ratio, and since marian has no domain interpolation (afaik), resorting to over-sampling. So I could have solved the problem by adding domain interpolation ... but then naturally occurring corpora could be over 100M sentences.

On 05/01/18 20:53, Marcin Junczys-Dowmunt wrote:

@bhaddow https://github.com/bhaddow How large is the corpus you are dealing with? I would like to test how SQLite behaves for larger corpora. This would also have the advantage that the training corpus could be modified during training.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/marian-nmt/marian-dev/issues/148#issuecomment-355662665, or mute the thread https://github.com/notifications/unsubscribe-auth/AA8mG8Z48XyTW5xbHTZoZyjuDyvGtguqks5tHovVgaJpZM4QzrXr.

The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.

emjotde commented 6 years ago

I guess it makes sense to look at both things at once. Does the sampling from different corpora mix the data, i.e. does one batch contain sentences from different file in roughly the right ratio?

ugermann commented 6 years ago

I wrote some code a few months ago for sampling corpora (without replacement). It memory-maps all relevant files, thus offering a smooth trade-off between available memory and speed. As an added bonus, all sentences in a batch have the same sentence length, filling up with slightly longer/shorter sentences once it runs out. This could easily be changed to an API that returns a random sentence (pair) within a given length range. Would there be interest for me to integrate that into Marian?

emjotde commented 6 years ago

With your suffix array? That sounds interesting. I was thinking of SQLite because it would allow to associate additional data with sentence pairs, for instance alignment, or references to paragraphs etc.

kpu commented 6 years ago

It sounds like you're proposing shuffling by random access to disk. Where the interface to the disk happens to be mmap.

ugermann commented 6 years ago

@emjotde No suffix arrays involved. Much simpler than that, as @kpu summarizes pointedly (although I'm not just proposing it; the code exists). The advantage is that at no point you actually have to load the entire corpus into memory for shuffling. It's not necessarily super-fast (because of random access to disk via mmap), but if speed is an issue, we could do that in the background to produce the next batch while Marian is processing the current one.

emjotde commented 6 years ago

With v1.3.0 of marian-dev you can now do --sqlite --tempdir directory/with/space. I tried it with 100M sentences. Works quite well, hardly any memory usage due to the corpus or shuffling, speed is decent.

emjotde commented 6 years ago

Closing for now, feel free to re-open if not satisfied.

geovedi commented 5 years ago

@emjotde i'm testing --sqlite option and noticed that plain text is stored. would storing token ids improve speed instead of plain text?

i'm training 520m corpus here, can confirm the memory usage is low.

emjotde commented 5 years ago

It’s a trade off I suppose. Token ids would take ages in the construction phase. Currently, you should only notice the parsing time into tokens during the very first pre-read as it cannot do any computation without having the data. Later there is no slow-down as we pre-read in parallel to computation and prereading is usually done much faster, so there is no idle time. If there is no idle time there is no potential for speed-up.

From: Jim Geovedi Sent: Sunday, December 16, 2018 4:53 AM To: marian-nmt/marian-dev Cc: Marcin Junczys-Dowmunt; Mention Subject: Re: [marian-nmt/marian-dev] Don't load entire corpus into memory onstart up (enhancement request) (#148)

@emjotde i'm testing --sqlite option and noticed that plain text is stored. would storing token ids improve speed instead of plain text? i'm training 520m corpus here, can confirm the memory usage is low. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

eu9ene commented 2 years ago

Hi, any updates on this? It really limits training for us. 50Gb corpus on a machine with 128Gb RAM. It's getting killed with Out of memory.

Removing --shuffle-in-ram doesn't solve the issue. The workaround for me is to use shuffle: batches but it messes up training for some models.

Should we reopen this issue? cc @kpu @XapaJIaMnu

Logging output in case it helps:

[2022-06-11 00:12:44] [marian] Marian v1.11.7 e27da623 2022-06-06 13:32:58 +0100
[2022-06-11 00:12:44] [marian] Running on mlc4 as process 29 with command line:
[2022-06-11 00:12:44] [marian] /data/rw/evgeny/bergamot-training1/3rd_party/marian-dev/build/marian --model /data/rw/evgeny/models/fr-en/canyons/backward/model.npz -c configs/model/backward.yml configs/training/backward.train.yml --train-sets /data/rw/evgeny/data/fr-en/canyons/biclean/corpus.en.gz /data/rw/evgeny/data/fr-en/canyons/biclean/corpus.fr.gz -T /data/rw/evgeny/models/fr-en/canyons/backward/tmp --vocabs /data/rw/evgeny/models/fr-en/canyons/vocab/vocab.spm /data/rw/evgeny/models/fr-en/canyons/vocab/vocab.spm -w 8000 --devices 0 1 2 3 4 5 6 7 --sharding local --sync-sgd --valid-metrics chrf ce-mean-words bleu-detok --valid-sets /data/rw/evgeny/data/fr-en/canyons/original/devset.en.gz /data/rw/evgeny/data/fr-en/canyons/original/devset.fr.gz --valid-translation-output /data/rw/evgeny/models/fr-en/canyons/backward/devset.out --quiet-translation --overwrite --keep-best --log /data/rw/evgeny/models/fr-en/canyons/backward/train.log --valid-log /data/rw/evgeny/models/fr-en/canyons/backward/valid.log --after 2e
[2022-06-11 00:12:44] [config] after: 2e
[2022-06-11 00:12:44] [config] after-batches: 0
[2022-06-11 00:12:44] [config] after-epochs: 0
[2022-06-11 00:12:44] [config] all-caps-every: 0
[2022-06-11 00:12:44] [config] allow-unk: false
[2022-06-11 00:12:44] [config] authors: false
[2022-06-11 00:12:44] [config] beam-size: 12
[2022-06-11 00:12:44] [config] bert-class-symbol: "[CLS]"
[2022-06-11 00:12:44] [config] bert-mask-symbol: "[MASK]"
[2022-06-11 00:12:44] [config] bert-masking-fraction: 0.15
[2022-06-11 00:12:44] [config] bert-sep-symbol: "[SEP]"
[2022-06-11 00:12:44] [config] bert-train-type-embeddings: true
[2022-06-11 00:12:44] [config] bert-type-vocab-size: 2
[2022-06-11 00:12:44] [config] build-info: ""
[2022-06-11 00:12:44] [config] check-gradient-nan: false
[2022-06-11 00:12:44] [config] check-nan: false
[2022-06-11 00:12:44] [config] cite: false
[2022-06-11 00:12:44] [config] clip-norm: 1
[2022-06-11 00:12:44] [config] cost-scaling:
[2022-06-11 00:12:44] [config]   []
[2022-06-11 00:12:44] [config] cost-type: ce-mean-words
[2022-06-11 00:12:44] [config] cpu-threads: 0
[2022-06-11 00:12:44] [config] data-threads: 8
[2022-06-11 00:12:44] [config] data-weighting: ""
[2022-06-11 00:12:44] [config] data-weighting-type: sentence
[2022-06-11 00:12:44] [config] dec-cell: gru
[2022-06-11 00:12:44] [config] dec-cell-base-depth: 2
[2022-06-11 00:12:44] [config] dec-cell-high-depth: 1
[2022-06-11 00:12:44] [config] dec-depth: 1
[2022-06-11 00:12:44] [config] devices:
[2022-06-11 00:12:44] [config]   - 0
[2022-06-11 00:12:44] [config]   - 1
[2022-06-11 00:12:44] [config]   - 2
[2022-06-11 00:12:44] [config]   - 3
[2022-06-11 00:12:44] [config]   - 4
[2022-06-11 00:12:44] [config]   - 5
[2022-06-11 00:12:44] [config]   - 6
[2022-06-11 00:12:44] [config]   - 7
[2022-06-11 00:12:44] [config] dim-emb: 512
[2022-06-11 00:12:44] [config] dim-rnn: 1024
[2022-06-11 00:12:44] [config] dim-vocabs:
[2022-06-11 00:12:44] [config]   - 32000
[2022-06-11 00:12:44] [config]   - 32000
[2022-06-11 00:12:44] [config] disp-first: 0
[2022-06-11 00:12:44] [config] disp-freq: 1000
[2022-06-11 00:12:44] [config] disp-label-counts: true
[2022-06-11 00:12:44] [config] dropout-rnn: 0
[2022-06-11 00:12:44] [config] dropout-src: 0
[2022-06-11 00:12:44] [config] dropout-trg: 0
[2022-06-11 00:12:44] [config] dump-config: ""
[2022-06-11 00:12:44] [config] dynamic-gradient-scaling:
[2022-06-11 00:12:44] [config]   []
[2022-06-11 00:12:44] [config] early-stopping: 5
[2022-06-11 00:12:44] [config] early-stopping-on: first
[2022-06-11 00:12:44] [config] embedding-fix-src: false
[2022-06-11 00:12:44] [config] embedding-fix-trg: false
[2022-06-11 00:12:44] [config] embedding-normalization: false
[2022-06-11 00:12:44] [config] embedding-vectors:
[2022-06-11 00:12:44] [config]   []
[2022-06-11 00:12:44] [config] enc-cell: gru
[2022-06-11 00:12:44] [config] enc-cell-depth: 1
[2022-06-11 00:12:44] [config] enc-depth: 1
[2022-06-11 00:12:44] [config] enc-type: bidirectional
[2022-06-11 00:12:44] [config] english-title-case-every: 0
[2022-06-11 00:12:44] [config] exponential-smoothing: True
[2022-06-11 00:12:44] [config] factor-weight: 1
[2022-06-11 00:12:44] [config] factors-combine: sum
[2022-06-11 00:12:44] [config] factors-dim-emb: 0
[2022-06-11 00:12:44] [config] gradient-checkpointing: false
[2022-06-11 00:12:44] [config] gradient-norm-average-window: 100
[2022-06-11 00:12:44] [config] guided-alignment: none
[2022-06-11 00:12:44] [config] guided-alignment-cost: ce
[2022-06-11 00:12:44] [config] guided-alignment-weight: 0.1
[2022-06-11 00:12:44] [config] ignore-model-config: false
[2022-06-11 00:12:44] [config] input-types:
[2022-06-11 00:12:44] [config]   []
[2022-06-11 00:12:44] [config] interpolate-env-vars: false
[2022-06-11 00:12:44] [config] keep-best: true
[2022-06-11 00:12:44] [config] label-smoothing: 0
[2022-06-11 00:12:44] [config] layer-normalization: True
[2022-06-11 00:12:44] [config] learn-rate: 0.0001
[2022-06-11 00:12:44] [config] lemma-dependency: ""
[2022-06-11 00:12:44] [config] lemma-dim-emb: 0
[2022-06-11 00:12:44] [config] log: /data/rw/evgeny/models/fr-en/canyons/backward/train.log
[2022-06-11 00:12:44] [config] log-level: info
[2022-06-11 00:12:44] [config] log-time-zone: ""
[2022-06-11 00:12:44] [config] logical-epoch:
[2022-06-11 00:12:44] [config]   - 1e
[2022-06-11 00:12:44] [config]   - 0
[2022-06-11 00:12:44] [config] lr-decay: 0
[2022-06-11 00:12:44] [config] lr-decay-freq: 50000
[2022-06-11 00:12:44] [config] lr-decay-inv-sqrt:
[2022-06-11 00:12:44] [config]   - 0
[2022-06-11 00:12:44] [config] lr-decay-repeat-warmup: false
[2022-06-11 00:12:44] [config] lr-decay-reset-optimizer: false
[2022-06-11 00:12:44] [config] lr-decay-start:
[2022-06-11 00:12:44] [config]   - 10
[2022-06-11 00:12:44] [config]   - 1
[2022-06-11 00:12:44] [config] lr-decay-strategy: epoch+stalled
[2022-06-11 00:12:44] [config] lr-report: false
[2022-06-11 00:12:44] [config] lr-warmup: 0
[2022-06-11 00:12:44] [config] lr-warmup-at-reload: false
[2022-06-11 00:12:44] [config] lr-warmup-cycle: false
[2022-06-11 00:12:44] [config] lr-warmup-start-rate: 0
[2022-06-11 00:12:44] [config] max-length: 100
[2022-06-11 00:12:44] [config] max-length-crop: false
[2022-06-11 00:12:44] [config] max-length-factor: 3
[2022-06-11 00:12:44] [config] maxi-batch: 1000
[2022-06-11 00:12:44] [config] maxi-batch-sort: trg
[2022-06-11 00:12:44] [config] mini-batch: 64
[2022-06-11 00:12:44] [config] mini-batch-fit: True
[2022-06-11 00:12:44] [config] mini-batch-fit-step: 10
[2022-06-11 00:12:44] [config] mini-batch-round-up: true
[2022-06-11 00:12:44] [config] mini-batch-track-lr: false
[2022-06-11 00:12:44] [config] mini-batch-warmup: 0
[2022-06-11 00:12:44] [config] mini-batch-words: 0
[2022-06-11 00:12:44] [config] mini-batch-words-ref: 0
[2022-06-11 00:12:44] [config] model: /data/rw/evgeny/models/fr-en/canyons/backward/model.npz
[2022-06-11 00:12:44] [config] multi-loss-type: sum
[2022-06-11 00:12:44] [config] n-best: false
[2022-06-11 00:12:44] [config] no-nccl: false
[2022-06-11 00:12:44] [config] no-reload: false
[2022-06-11 00:12:44] [config] no-restore-corpus: false
[2022-06-11 00:12:44] [config] normalize: 1
[2022-06-11 00:12:44] [config] normalize-gradient: false
[2022-06-11 00:12:44] [config] num-devices: 0
[2022-06-11 00:12:44] [config] optimizer: adam
[2022-06-11 00:12:44] [config] optimizer-delay: 1
[2022-06-11 00:12:44] [config] optimizer-params:
[2022-06-11 00:12:44] [config]   []
[2022-06-11 00:12:44] [config] output-omit-bias: false
[2022-06-11 00:12:44] [config] overwrite: true
[2022-06-11 00:12:44] [config] precision:
[2022-06-11 00:12:44] [config]   - float32
[2022-06-11 00:12:44] [config]   - float32
[2022-06-11 00:12:44] [config] pretrained-model: ""
[2022-06-11 00:12:44] [config] quantize-biases: false
[2022-06-11 00:12:44] [config] quantize-bits: 0
[2022-06-11 00:12:44] [config] quantize-log-based: false
[2022-06-11 00:12:44] [config] quantize-optimization-steps: 0
[2022-06-11 00:12:44] [config] quiet: false
[2022-06-11 00:12:44] [config] quiet-translation: true
[2022-06-11 00:12:44] [config] relative-paths: false
[2022-06-11 00:12:44] [config] right-left: false
[2022-06-11 00:12:44] [config] save-freq: 10000
[2022-06-11 00:12:44] [config] seed: 0
[2022-06-11 00:12:44] [config] sentencepiece-alphas:
[2022-06-11 00:12:44] [config]   []
[2022-06-11 00:12:44] [config] sentencepiece-max-lines: 2000000
[2022-06-11 00:12:44] [config] sentencepiece-options: ""
[2022-06-11 00:12:44] [config] sharding: local
[2022-06-11 00:12:44] [config] shuffle: data
[2022-06-11 00:12:44] [config] shuffle-in-ram: false
[2022-06-11 00:12:44] [config] sigterm: save-and-exit
[2022-06-11 00:12:44] [config] skip: false
[2022-06-11 00:12:44] [config] sqlite: ""
[2022-06-11 00:12:44] [config] sqlite-drop: false
[2022-06-11 00:12:44] [config] sync-freq: 200u
[2022-06-11 00:12:44] [config] sync-sgd: true
[2022-06-11 00:12:44] [config] tempdir: /data/rw/evgeny/models/fr-en/canyons/backward/tmp
[2022-06-11 00:12:44] [config] tied-embeddings: false
[2022-06-11 00:12:44] [config] tied-embeddings-all: True
[2022-06-11 00:12:44] [config] tied-embeddings-src: false
[2022-06-11 00:12:44] [config] train-embedder-rank:
[2022-06-11 00:12:44] [config]   []
[2022-06-11 00:12:44] [config] train-sets:
[2022-06-11 00:12:44] [config]   - /data/rw/evgeny/data/fr-en/canyons/biclean/corpus.en.gz
[2022-06-11 00:12:44] [config]   - /data/rw/evgeny/data/fr-en/canyons/biclean/corpus.fr.gz
[2022-06-11 00:12:44] [config] transformer-aan-activation: swish
[2022-06-11 00:12:44] [config] transformer-aan-depth: 2
[2022-06-11 00:12:44] [config] transformer-aan-nogate: false
[2022-06-11 00:12:44] [config] transformer-decoder-autoreg: self-attention
[2022-06-11 00:12:44] [config] transformer-decoder-dim-ffn: 0
[2022-06-11 00:12:44] [config] transformer-decoder-ffn-depth: 0
[2022-06-11 00:12:44] [config] transformer-depth-scaling: false
[2022-06-11 00:12:44] [config] transformer-dim-aan: 2048
[2022-06-11 00:12:44] [config] transformer-dim-ffn: 2048
[2022-06-11 00:12:44] [config] transformer-dropout: 0
[2022-06-11 00:12:44] [config] transformer-dropout-attention: 0
[2022-06-11 00:12:44] [config] transformer-dropout-ffn: 0
[2022-06-11 00:12:44] [config] transformer-ffn-activation: swish
[2022-06-11 00:12:44] [config] transformer-ffn-depth: 2
[2022-06-11 00:12:44] [config] transformer-guided-alignment-layer: last
[2022-06-11 00:12:44] [config] transformer-heads: 8
[2022-06-11 00:12:44] [config] transformer-no-projection: false
[2022-06-11 00:12:44] [config] transformer-pool: false
[2022-06-11 00:12:44] [config] transformer-postprocess: dan
[2022-06-11 00:12:44] [config] transformer-postprocess-emb: d
[2022-06-11 00:12:44] [config] transformer-postprocess-top: ""
[2022-06-11 00:12:44] [config] transformer-preprocess: ""
[2022-06-11 00:12:44] [config] transformer-tied-layers:
[2022-06-11 00:12:44] [config]   []
[2022-06-11 00:12:44] [config] transformer-train-position-embeddings: false
[2022-06-11 00:12:44] [config] tsv: false
[2022-06-11 00:12:44] [config] tsv-fields: 0
[2022-06-11 00:12:44] [config] type: s2s
[2022-06-11 00:12:44] [config] ulr: false
[2022-06-11 00:12:44] [config] ulr-dim-emb: 0
[2022-06-11 00:12:44] [config] ulr-dropout: 0
[2022-06-11 00:12:44] [config] ulr-keys-vectors: ""
[2022-06-11 00:12:44] [config] ulr-query-vectors: ""
[2022-06-11 00:12:44] [config] ulr-softmax-temperature: 1
[2022-06-11 00:12:44] [config] ulr-trainable-transformation: false
[2022-06-11 00:12:44] [config] unlikelihood-loss: false
[2022-06-11 00:12:44] [config] valid-freq: 10000
[2022-06-11 00:12:44] [config] valid-log: /data/rw/evgeny/models/fr-en/canyons/backward/valid.log
[2022-06-11 00:12:44] [config] valid-max-length: 1000
[2022-06-11 00:12:44] [config] valid-metrics:
[2022-06-11 00:12:44] [config]   - chrf
[2022-06-11 00:12:44] [config]   - ce-mean-words
[2022-06-11 00:12:44] [config]   - bleu-detok
[2022-06-11 00:12:44] [config] valid-mini-batch: 64
[2022-06-11 00:12:44] [config] valid-reset-stalled: false
[2022-06-11 00:12:44] [config] valid-script-args:
[2022-06-11 00:12:44] [config]   []
[2022-06-11 00:12:44] [config] valid-script-path: ""
[2022-06-11 00:12:44] [config] valid-sets:
[2022-06-11 00:12:44] [config]   - /data/rw/evgeny/data/fr-en/canyons/original/devset.en.gz
[2022-06-11 00:12:44] [config]   - /data/rw/evgeny/data/fr-en/canyons/original/devset.fr.gz
[2022-06-11 00:12:44] [config] valid-translation-output: /data/rw/evgeny/models/fr-en/canyons/backward/devset.out
[2022-06-11 00:12:44] [config] vocabs:
[2022-06-11 00:12:44] [config]   - /data/rw/evgeny/models/fr-en/canyons/vocab/vocab.spm
[2022-06-11 00:12:44] [config]   - /data/rw/evgeny/models/fr-en/canyons/vocab/vocab.spm
[2022-06-11 00:12:44] [config] word-penalty: 0
[2022-06-11 00:12:44] [config] word-scores: false
[2022-06-11 00:12:44] [config] workspace: 8000
[2022-06-11 00:12:44] [config] Model is being created with Marian v1.11.7 e27da623 2022-06-06 13:32:58 +0100
[2022-06-11 00:12:44] Using synchronous SGD
[2022-06-11 00:12:44] [comm] Compiled without MPI support. Running as a single process on mlc4
[2022-06-11 00:12:44] Synced seed 1654899164
[2022-06-11 00:12:44] [data] Loading SentencePiece vocabulary from file /data/rw/evgeny/models/fr-en/canyons/vocab/vocab.spm
[2022-06-11 00:12:44] [data] Setting vocabulary size for input 0 to 32,000
[2022-06-11 00:12:44] [data] Loading SentencePiece vocabulary from file /data/rw/evgeny/models/fr-en/canyons/vocab/vocab.spm
[2022-06-11 00:12:44] [data] Setting vocabulary size for input 1 to 32,000
[2022-06-11 00:12:44] [batching] Collecting statistics for batch fitting with step size 10
[2022-06-11 00:12:45] [memory] Extending reserved space to 8064 MB (device gpu0)
[2022-06-11 00:12:45] [memory] Extending reserved space to 8064 MB (device gpu1)
[2022-06-11 00:12:46] [memory] Extending reserved space to 8064 MB (device gpu2)
[2022-06-11 00:12:46] [memory] Extending reserved space to 8064 MB (device gpu3)
[2022-06-11 00:12:46] [memory] Extending reserved space to 8064 MB (device gpu4)
[2022-06-11 00:12:47] [memory] Extending reserved space to 8064 MB (device gpu5)
[2022-06-11 00:12:47] [memory] Extending reserved space to 8064 MB (device gpu6)
[2022-06-11 00:12:47] [memory] Extending reserved space to 8064 MB (device gpu7)
[2022-06-11 00:12:47] [comm] Using NCCL 2.8.3 for GPU communication
[2022-06-11 00:12:47] [comm] Using global sharding
[2022-06-11 00:12:49] [comm] NCCLCommunicators constructed successfully
[2022-06-11 00:12:49] [training] Using 8 GPUs
[2022-06-11 00:12:49] [logits] Applying loss function for 1 factor(s)
[2022-06-11 00:12:49] [memory] Reserving 191 MB, device gpu0
[2022-06-11 00:12:50] [gpu] 16-bit TensorCores enabled for float32 matrix operations
[2022-06-11 00:12:50] [memory] Reserving 191 MB, device gpu0
[2022-06-11 00:13:20] [batching] Done. Typical MB size is 118,536 target words
[2022-06-11 00:13:20] [memory] Extending reserved space to 8064 MB (device gpu0)
[2022-06-11 00:13:20] [memory] Extending reserved space to 8064 MB (device gpu1)
[2022-06-11 00:13:21] [memory] Extending reserved space to 8064 MB (device gpu2)
[2022-06-11 00:13:21] [memory] Extending reserved space to 8064 MB (device gpu3)
[2022-06-11 00:13:21] [memory] Extending reserved space to 8064 MB (device gpu4)
[2022-06-11 00:13:21] [memory] Extending reserved space to 8064 MB (device gpu5)
[2022-06-11 00:13:21] [memory] Extending reserved space to 8064 MB (device gpu6)
[2022-06-11 00:13:21] [memory] Extending reserved space to 8064 MB (device gpu7)
[2022-06-11 00:13:21] [comm] Using NCCL 2.8.3 for GPU communication
[2022-06-11 00:13:21] [comm] Using global sharding
[2022-06-11 00:13:22] [comm] NCCLCommunicators constructed successfully
[2022-06-11 00:13:22] [training] Using 8 GPUs
[2022-06-11 00:13:22] Training started
[2022-06-11 00:13:22] [data] Shuffling data
tcmalloc: large alloc 1073741824 bytes == 0x564b5f964000 @
tcmalloc: large alloc 2147483648 bytes == 0x564cb01ca000 @
tcmalloc: large alloc 2147483648 bytes == 0x564d30a2e000 @
tcmalloc: large alloc 4294967296 bytes == 0x564f89098000 @
tcmalloc: large alloc 4294967296 bytes == 0x565089098000 @
tcmalloc: large alloc 8589934592 bytes == 0x565512866000 @
tcmalloc: large alloc 8589934592 bytes == 0x56571312e000 @
tcmalloc: large alloc 17179869184 bytes == 0x565e6cd60000 @
tcmalloc: large alloc 17179869184 bytes == 0x56626de2e000 @
src/central_freelist.cc:333] tcmalloc: allocation failed 16384
[2022-06-11 00:24:25] Error: Unhandled exception of type 'St9bad_alloc': std::bad_alloc
[2022-06-11 00:24:25] Error: Aborted from void unhandledException() in /data/rw/evgeny/bergamot-training1/3rd_party/marian-dev/src/common/logging.cpp:113

[CALL STACK]
[0x5649ffe85ea8]                                                       + 0x469ea8
[0x7fe1a33e7ae6]                                                       + 0x92ae6
[0x7fe1a33e7b21]                                                       + 0x92b21
[0x7fe1a33e7d54]                                                       + 0x92d54
[0x7fe1a36ea04b]                                                       + 0xc04b
[0x7fe1a36faa8c]    tc_newarray                                        + 0x20c
[0x5649ffd903b3]                                                       + 0x3743b3
[0x5649ffdba56a]    std::vector<std::__cxx11::basic_string<char,std::char_traits<char>,std::allocator<char>>,std::allocator<std::__cxx11::basic_string<char,std::char_traits<char>,std::allocator<char>>>>::  push_back  (std::__cxx11::basic_string<char,std::char_traits<char>,std::allocator<char>> const&) + 0x2a
[0x5649fff791fb]    marian::data::Corpus::  shuffleData  (std::vector<std::__cxx11::basic_string<char,std::char_traits<char>,std::allocator<char>>,std::allocator<std::__cxx11::basic_string<char,std::char_traits<char>,std::allocator<char>>>> const&) + 0x18db
[0x5649ffe4b45d]    marian::Train<marian::SyncGraphGroup>::  run  ()   + 0x14ad
[0x5649ffd91ad6]    mainTrainer  (int,  char**)                        + 0x136
[0x5649ffd46855]    main                                               + 0x35
[0x7fe1a29cfbf7]    __libc_start_main                                  + 0xe7
[0x5649ffd8ff3a]    _start                                             + 0x2a

pipeline/train/train.sh: line 50:    29 Aborted                 (core dumped) "${MARIAN}/marian" --model "${model_dir}/model.npz" -c "configs/model/${model_type}.yml" "configs/training/${model_type}.${training_type}.yml" --train-sets "${train_set_prefix}".{"${src}","${trg}"}.gz -T "${model_dir}/tmp" --vocabs "${vocab}" "${vocab}" -w "${WORKSPACE}" --devices ${GPUS} --sharding local --sync-sgd --valid-metrics chrf ce-mean-words bleu-detok --valid-sets "${valid_set_prefix}".{"${src}","${trg}"}.gz --valid-translation-output "${model_dir}/devset.out" --quiet-translation --overwrite --keep-best --log "${model_dir}/train.log" --valid-log "${model_dir}/valid.log" "${extra_params[@]}"

snukky commented 2 years ago

Marian accepts training data provided as a continuous stream into stdin (e.g., cat corpus.tsv | marian -t stdin --tsv), maybe that can be used as another workaround?

The workaround for me is to use shuffle: batches but it messes up training for some models

Can you elaborate?

eu9ene commented 2 years ago

The workaround for me is to use shuffle: batches but it messes up training for some models

Can you elaborate?

I trained multiple models for the same language pair in the opposite directions (two teachers and one backward model for each). I used the same parallel corpus without back-translations.

The first set is trained on a machine with twice more RAM where I can use default shuffling mode and --shuffle-in-ram The training curves look like this:

The second set (in the opposite direction) is trained on the machines where I don't have enough memory and have to use shuffle: batches without --shuffle-in-ram. I see such weird curves. BLEU score is always 0 for one teacher, basically it doesn't train.

It's the first time I tried to train teachers with shuffle: batches. I tried it before for student models and didn't observe such issues. The corpus is pre-shuffled before the training by an external script in both cases. Related issue: https://github.com/mozilla/firefox-translations-training/issues/21

marian-nmt / marian-dev

Don't load entire corpus into memory on start up (enhancement request) #148