About Programmatically usage

loretoparisi commented 5 years ago

I'm trying to use the package programatically. I'm doing

    from subword_nmt.apply_bpe import BPE, read_vocabulary
     # read/write files as UTF-8
    bpe_codes_fin = codecs.open(bpe_codes, encoding='utf-8')
    bpe_vocab_fin = codecs.open(bpe_vocab, encoding='utf-8')
    vocabulary = read_vocabulary(bpe_vocab_fin, vocabulary_threshold)

    bpe = BPE(bpe_codes_fin, merges=-1, separator='@@', vocab=vocabulary, glossaries=None)
    codes = bpe.process_line(line)

Is that correct? Also, I'm not sure of the vocabulary_threshold, since I do not see any default value. Is there any one?

Thank you.

rsennrich commented 5 years ago

This looks mostly fine. Two remarks:

vocabulary is optional. Its function is described in the README. If you use it, you can provide a vocabulary-threshold (effectively filtering out low-frequency items from the vocabulary), but this is also optional.
you will typically want to apply BPE to more than one line. If so, make sure that only the last line is executed repeatedly.

loretoparisi commented 5 years ago

Thank you, I'm going to split lines to handle both cases then. In my model I have both vocabulary and codes, but at this point my wonder becomes: how to get the right threshold? I mean assumed I have a vocabulary already, shall I have make some stats to get the lowest frequency words?

loretoparisi commented 5 years ago

@rsennrich I get this error:

Error: invalid line 1 in BPE codes file: e n 52708119
The line should exist of exactly two subword units, separated by whitespace

My codes and vocabulary files are from FAIR LASER model:

that are like

root@3f40ea8e2cc4:/tornado_api# head -n10 /root/laser_models/93langs.fvocab 
. 87264459
, 78156033
de 19001435
- 13731976
? 13338524
a 13062980
i 8917603
en 8272731
" 8258142
la 7623301
root@3f40ea8e2cc4:/tornado_api# head -n10 /root/laser_models/93langs.fcodes 
e n 52708119
e r 51024442
e n</w> 47209692
a n 46619244
i n 44583543
s t 42633672
a r 34974160
o n 31941788
t i 30717853
d e 30509691

The vocabulary is loaded correctly through the read_vocabulary api, while I immediately get that error I presume when passing to the line

encoder = BPE(bpe_codes_fin, merges=-1, separator='@@', vocab=vocabulary, glossaries=None)

rsennrich commented 5 years ago

As to your first question, have a look at your vocabulary file - whether you set the threshold to 5 or 500 won't make a big difference for you, since most rare tokens are single (non-Latin) characters that won't be affected by this.

FAIR LASER uses a different BPE implementation ( https://github.com/glample/fastBPE ), which seems to store the BPE file in a different format. It might work if you simply remove the third item in each entry (the frequency), but I can't guarantee there's no other inconsistency, e.g. in how UTF-8 whitespace is handled.

loretoparisi commented 5 years ago

@rsennrich thank you, looking at the results it seems the problem is the third column only, so we did

self.bpe_codes = [tuple(item.strip('\r\n ').split(' ')[:2]) for (n, item) in enumerate(codes) if (n < merges or merges == -1)]

Regarding the compatibility with fastBPE I thought there was an official approach to follow, sort of I mean. Assumed that I load the same codes and dictionary I get different results:

Using fastBPE

hoy quiero que te qu@@ ede &@@ apo@@ s@@ ; a dormir
this song is gonna make you mad

Using subword-nmt

ho@@ y qui@@ ero que te que@@ de &@@ apo@@ s@@ ; a dor@@ mir
th@@ is son@@ g is gon@@ na make you mad

Which can be the issue here?

rsennrich commented 5 years ago

try adding this as the first line to the BPE file:

#version: 0.2

the reason for this is explained in the README. It looks like fastBPE implements the new variant (v 0.2) as well.

loretoparisi commented 5 years ago

@rsennrich ok, so basically subword-nmt needs the comment to detect the version. The only issue I see is that if I have a pre-trained file it can happen that I cannot modify it. Thanks, closing.

RenShuhuai-Andy commented 4 years ago

try adding this as the first line to the BPE file:
#version: 0.2
the reason for this is explained in the README. It looks like fastBPE implements the new variant (v 0.2) as well.

Hi~ it doesn't work for me. The error log is Error: invalid line 1 in BPE codes file: e n</w> 1423551864 before adding #version: 0.2, then it's Error: invalid line 2 in BPE codes file: e n</w> 1423551864 ... The BPE file I used is downloaded from fairseq: transformer.wmt19.en-de, and I export LANG=en_US.UTF-8; export LC_ALL=en_US.UTF-8. Any advice? @rsennrich

RenShuhuai-Andy commented 4 years ago

try adding this as the first line to the BPE file:
#version: 0.2
the reason for this is explained in the README. It looks like fastBPE implements the new variant (v 0.2) as well.
Hi~ it doesn't work for me. The error log is Error: invalid line 1 in BPE codes file: e n</w> 1423551864 before adding #version: 0.2, then it's Error: invalid line 2 in BPE codes file: e n</w> 1423551864 ... The BPE file I used is downloaded from fairseq: transformer.wmt19.en-de, and I export LANG=en_US.UTF-8; export LC_ALL=en_US.UTF-8. Any advice? @rsennrich

oh I have solved this problem, I set the bpe parameter incorrectly, sorry

rsennrich / subword-nmt

About Programmatically usage #76