rsennrich / subword-nmt

Unsupervised Word Segmentation for Neural Machine Translation and Text Generation
MIT License
2.18k stars 464 forks source link

Recover back code file #105

Closed Hannibal046 closed 2 years ago

Hannibal046 commented 3 years ago

Hello, I am wondering is there any possibility to recover the code file of bpe when I only have bped-file. with such bped-file, I can get the vocab by split.(' '), and remove the @ by re.sub(r'(@@ )|(@@ ?$), but i don't know how to encoder new raw text,thanks so much!

rsennrich commented 2 years ago

apologies for the late response.

If you have the original training text, it might be possible to reverse-engineer the list and order of merge operations, but it would be easier to just re-learn BPE (the implementation is deterministic).

If you only have the subword vocabulary, you can write a simple dynamic program that will split each word into valid subwords (here's one example: https://www.geeksforgeeks.org/word-break-problem-dp-32/ ). Out of the possible solutions, choose the one that is the shortest.