Closed Hannibal046 closed 2 years ago
apologies for the late response.
If you have the original training text, it might be possible to reverse-engineer the list and order of merge operations, but it would be easier to just re-learn BPE (the implementation is deterministic).
If you only have the subword vocabulary, you can write a simple dynamic program that will split each word into valid subwords (here's one example: https://www.geeksforgeeks.org/word-break-problem-dp-32/ ). Out of the possible solutions, choose the one that is the shortest.
Hello, I am wondering is there any possibility to recover the code file of bpe when I only have bped-file. with such bped-file, I can get the vocab by
split.(' ')
, and remove the @ byre.sub(r'(@@ )|(@@ ?$)
, but i don't know how to encoder new raw text,thanks so much!