rsennrich / subword-nmt

Unsupervised Word Segmentation for Neural Machine Translation and Text Generation
MIT License
2.18k stars 464 forks source link

About Programmatically usage #76

Closed loretoparisi closed 5 years ago

loretoparisi commented 5 years ago

I'm trying to use the package programatically. I'm doing

    from subword_nmt.apply_bpe import BPE, read_vocabulary
     # read/write files as UTF-8
    bpe_codes_fin = codecs.open(bpe_codes, encoding='utf-8')
    bpe_vocab_fin = codecs.open(bpe_vocab, encoding='utf-8')
    vocabulary = read_vocabulary(bpe_vocab_fin, vocabulary_threshold)

    bpe = BPE(bpe_codes_fin, merges=-1, separator='@@', vocab=vocabulary, glossaries=None)
    codes = bpe.process_line(line)

Is that correct? Also, I'm not sure of the vocabulary_threshold, since I do not see any default value. Is there any one?

Thank you.

rsennrich commented 5 years ago

This looks mostly fine. Two remarks:

loretoparisi commented 5 years ago

Thank you, I'm going to split lines to handle both cases then. In my model I have both vocabulary and codes, but at this point my wonder becomes: how to get the right threshold? I mean assumed I have a vocabulary already, shall I have make some stats to get the lowest frequency words?

loretoparisi commented 5 years ago

@rsennrich I get this error:

Error: invalid line 1 in BPE codes file: e n 52708119
The line should exist of exactly two subword units, separated by whitespace

My codes and vocabulary files are from FAIR LASER model:

that are like

root@3f40ea8e2cc4:/tornado_api# head -n10 /root/laser_models/93langs.fvocab 
. 87264459
, 78156033
de 19001435
- 13731976
? 13338524
a 13062980
i 8917603
en 8272731
" 8258142
la 7623301
root@3f40ea8e2cc4:/tornado_api# head -n10 /root/laser_models/93langs.fcodes 
e n 52708119
e r 51024442
e n</w> 47209692
a n 46619244
i n 44583543
s t 42633672
a r 34974160
o n 31941788
t i 30717853
d e 30509691

The vocabulary is loaded correctly through the read_vocabulary api, while I immediately get that error I presume when passing to the line

encoder = BPE(bpe_codes_fin, merges=-1, separator='@@', vocab=vocabulary, glossaries=None)
rsennrich commented 5 years ago

As to your first question, have a look at your vocabulary file - whether you set the threshold to 5 or 500 won't make a big difference for you, since most rare tokens are single (non-Latin) characters that won't be affected by this.

FAIR LASER uses a different BPE implementation ( https://github.com/glample/fastBPE ), which seems to store the BPE file in a different format. It might work if you simply remove the third item in each entry (the frequency), but I can't guarantee there's no other inconsistency, e.g. in how UTF-8 whitespace is handled.

loretoparisi commented 5 years ago

@rsennrich thank you, looking at the results it seems the problem is the third column only, so we did

self.bpe_codes = [tuple(item.strip('\r\n ').split(' ')[:2]) for (n, item) in enumerate(codes) if (n < merges or merges == -1)]

Regarding the compatibility with fastBPE I thought there was an official approach to follow, sort of I mean. Assumed that I load the same codes and dictionary I get different results:

Using fastBPE

hoy quiero que te qu@@ ede &@@ apo@@ s@@ ; a dormir
this song is gonna make you mad

Using subword-nmt

ho@@ y qui@@ ero que te que@@ de &@@ apo@@ s@@ ; a dor@@ mir
th@@ is son@@ g is gon@@ na make you mad

Which can be the issue here?

rsennrich commented 5 years ago

try adding this as the first line to the BPE file:

#version: 0.2

the reason for this is explained in the README. It looks like fastBPE implements the new variant (v 0.2) as well.

loretoparisi commented 5 years ago

@rsennrich ok, so basically subword-nmt needs the comment to detect the version. The only issue I see is that if I have a pre-trained file it can happen that I cannot modify it. Thanks, closing.

RenShuhuai-Andy commented 4 years ago

try adding this as the first line to the BPE file:

#version: 0.2

the reason for this is explained in the README. It looks like fastBPE implements the new variant (v 0.2) as well.

Hi~ it doesn't work for me. The error log is Error: invalid line 1 in BPE codes file: e n</w> 1423551864 before adding #version: 0.2, then it's Error: invalid line 2 in BPE codes file: e n</w> 1423551864 ... The BPE file I used is downloaded from fairseq: transformer.wmt19.en-de, and I export LANG=en_US.UTF-8; export LC_ALL=en_US.UTF-8. Any advice? @rsennrich

RenShuhuai-Andy commented 4 years ago

try adding this as the first line to the BPE file:

#version: 0.2

the reason for this is explained in the README. It looks like fastBPE implements the new variant (v 0.2) as well.

Hi~ it doesn't work for me. The error log is Error: invalid line 1 in BPE codes file: e n</w> 1423551864 before adding #version: 0.2, then it's Error: invalid line 2 in BPE codes file: e n</w> 1423551864 ... The BPE file I used is downloaded from fairseq: transformer.wmt19.en-de, and I export LANG=en_US.UTF-8; export LC_ALL=en_US.UTF-8. Any advice? @rsennrich

oh I have solved this problem, I set the bpe parameter incorrectly, sorry