rsennrich / subword-nmt

Unsupervised Word Segmentation for Neural Machine Translation and Text Generation
MIT License
2.19k stars 465 forks source link

Error: invalid line 2 in BPE codes file when running apply_bpe.py #114

Closed jo0704 closed 2 years ago

jo0704 commented 2 years ago

Hi,

When running apply_bpe.py to segment given texts with the generated vocabulary I get the following error:

Error: invalid line 2 in BPE codes file: bpeout/vocab
The line should exist of exactly two subword units, separated by whitespace

The exact command lines I used:

echo "#version: 0.2" > bpeout/vocab.seg # add version info
echo bpeout/vocab >> bpeout/vocab.seg
python3 subword-nmt/subword_nmt/apply_bpe.py -c bpeout/vocab.seg <X-EN/de_en/train.de >bpeout/train_out.de

I added the vocab and train file I'm trying to segment: bpe_vocab.zip

A similar issue was reported here https://github.com/rsennrich/subword-nmt/issues/46 , but it doesn't seem to solve the error in my case.

rsennrich commented 2 years ago

Hi,

your vocab seems to have been produced by a different script, and is invalid. Specifically, there are lines containing just one symbol (__en__), and it's also not distinguishing between word-internal and word-final merge operations.