rsennrich / subword-nmt

Unsupervised Word Segmentation for Neural Machine Translation and Text Generation
MIT License
2.18k stars 464 forks source link

How to use the old version of bpe with the seperated </w> token? #85

Closed LearnedVector closed 4 years ago

LearnedVector commented 4 years ago

Hello, I have a use case that needs the old style bpe algorithm that has the as a separate space. How do I use that version?

rsennrich commented 4 years ago

apply_bpe.py will continue to work with old-style BPE files. To create an old-style BPE-file, you can either check out an old version, or undo the relevant changes yourself. Specifically, change

https://github.com/rsennrich/subword-nmt/blob/4cac90b5c2eda30e9069c094789b5c3cabc2e79f/subword_nmt/learn_bpe.py#L212

to

vocab = dict([(tuple(x)+('</w>',) ,y) for (x,y) in vocab.items()])

and remove this line:

https://github.com/rsennrich/subword-nmt/blob/4cac90b5c2eda30e9069c094789b5c3cabc2e79f/subword_nmt/learn_bpe.py#L209