marian-nmt / marian-examples

Examples, tutorials and use cases for Marian, including our WMT-2017/18 baselines.
Other
78 stars 34 forks source link

Vocab can't be loaded from SentencePiece model #12

Closed alvations closed 5 years ago

alvations commented 5 years ago

After training like in https://github.com/marian-nmt/marian-examples/tree/master/training-basics-sentencepiece , the marian-decoder is throwing error when decoding:

~/marian/build/marian-decoder -c model.npz.decoder.yml 
[2019-03-04 09:22:36] [config] alignment: 0
[2019-03-04 09:22:36] [config] allow-unk: false
[2019-03-04 09:22:36] [config] beam-size: 6
[2019-03-04 09:22:36] [config] best-deep: false
[2019-03-04 09:22:36] [config] clip-gemm: 0
[2019-03-04 09:22:36] [config] cpu-threads: 0
[2019-03-04 09:22:36] [config] dec-cell: gru
[2019-03-04 09:22:36] [config] dec-cell-base-depth: 2
[2019-03-04 09:22:36] [config] dec-cell-high-depth: 1
[2019-03-04 09:22:36] [config] dec-depth: 6
[2019-03-04 09:22:36] [config] devices:
[2019-03-04 09:22:36] [config]   - 0
[2019-03-04 09:22:36] [config] dim-emb: 1024
[2019-03-04 09:22:36] [config] dim-rnn: 1024
[2019-03-04 09:22:36] [config] dim-vocabs:
[2019-03-04 09:22:36] [config]   - 32000
[2019-03-04 09:22:36] [config]   - 32000
[2019-03-04 09:22:36] [config] enc-cell: gru
[2019-03-04 09:22:36] [config] enc-cell-depth: 1
[2019-03-04 09:22:36] [config] enc-depth: 6
[2019-03-04 09:22:36] [config] enc-type: bidirectional
[2019-03-04 09:22:36] [config] ignore-model-config: false
[2019-03-04 09:22:36] [config] input:
[2019-03-04 09:22:36] [config]   - stdin
[2019-03-04 09:22:36] [config] interpolate-env-vars: false
[2019-03-04 09:22:36] [config] layer-normalization: false
[2019-03-04 09:22:36] [config] log-level: info
[2019-03-04 09:22:36] [config] max-length: 1000
[2019-03-04 09:22:36] [config] max-length-crop: false
[2019-03-04 09:22:36] [config] max-length-factor: 3
[2019-03-04 09:22:36] [config] maxi-batch: 100
[2019-03-04 09:22:36] [config] maxi-batch-sort: src
[2019-03-04 09:22:36] [config] mini-batch: 16
[2019-03-04 09:22:36] [config] mini-batch-words: 0
[2019-03-04 09:22:36] [config] models:
[2019-03-04 09:22:36] [config]   - /disk2/models/ja-en/model.npz
[2019-03-04 09:22:36] [config] n-best: false
[2019-03-04 09:22:36] [config] normalize: 0.6
[2019-03-04 09:22:36] [config] optimize: false
[2019-03-04 09:22:36] [config] port: 8080
[2019-03-04 09:22:36] [config] quiet: false
[2019-03-04 09:22:36] [config] quiet-translation: false
[2019-03-04 09:22:36] [config] relative-paths: false
[2019-03-04 09:22:36] [config] right-left: false
[2019-03-04 09:22:36] [config] seed: 0
[2019-03-04 09:22:36] [config] skip: false
[2019-03-04 09:22:36] [config] skip-cost: false
[2019-03-04 09:22:36] [config] tied-embeddings: false
[2019-03-04 09:22:36] [config] tied-embeddings-all: true
[2019-03-04 09:22:36] [config] tied-embeddings-src: false
[2019-03-04 09:22:36] [config] transformer-aan-activation: swish
[2019-03-04 09:22:36] [config] transformer-aan-depth: 2
[2019-03-04 09:22:36] [config] transformer-aan-nogate: false
[2019-03-04 09:22:36] [config] transformer-decoder-autoreg: self-attention
[2019-03-04 09:22:36] [config] transformer-dim-aan: 2048
[2019-03-04 09:22:36] [config] transformer-dim-ffn: 4096
[2019-03-04 09:22:36] [config] transformer-ffn-activation: swish
[2019-03-04 09:22:36] [config] transformer-ffn-depth: 2
[2019-03-04 09:22:36] [config] transformer-guided-alignment-layer: last
[2019-03-04 09:22:36] [config] transformer-heads: 8
[2019-03-04 09:22:36] [config] transformer-no-projection: false
[2019-03-04 09:22:36] [config] transformer-postprocess: da
[2019-03-04 09:22:36] [config] transformer-postprocess-emb: d
[2019-03-04 09:22:36] [config] transformer-preprocess: n
[2019-03-04 09:22:36] [config] transformer-tied-layers:
[2019-03-04 09:22:36] [config]   []
[2019-03-04 09:22:36] [config] type: transformer
[2019-03-04 09:22:36] [config] version: v1.7.6 9cc5b176 2018-12-14 15:11:34 -0800
[2019-03-04 09:22:36] [config] vocabs:
[2019-03-04 09:22:36] [config]   - /disk2/models/ja-en/vocab.src.spm
[2019-03-04 09:22:36] [config]   - /disk2/models/ja-en/vocab.trg.spm
[2019-03-04 09:22:36] [config] word-penalty: 0
[2019-03-04 09:22:36] [config] workspace: 512
[2019-03-04 09:22:36] [config] Model created with Marian v1.7.6 9cc5b176 2018-12-14 15:11:34 -0800
[2019-03-04 09:22:36] [data] Loading vocabulary from text file /disk2/models/ja-en/vocab.src.spm
[2019-03-04 09:22:36] Vocabulary file /disk2/models/ja-en/vocab.src.spm must not contain empty lines
Aborted from int marian::Vocab::load(const string&, int) in /home/ltan/marian/src/marian/src/data/vocab.cpp: 117

My config file looks like this:

$ cat model.npz.decoder.yml 
models:
  - /disk2/models/ja-en/model.npz
vocabs:
  - /disk2/models/ja-en/vocab.src.spm
  - /disk2/models/ja-en/vocab.trg.spm
beam-size: 6
normalize: 0.6
word-penalty: 0
mini-batch: 16
maxi-batch: 100
maxi-batch-sort: src
relative-paths: false

Is there a special argument that needs to be used when using sentence piece as the tokenizer when decoding?

alvations commented 5 years ago

It's strange. Somehow I recompiled the binary and it works. Although the same version of the binary was compiled, version: v1.7.6 9cc5b176. At least it works now =)