rsennrich / subword-nmt

Unsupervised Word Segmentation for Neural Machine Translation and Text Generation
MIT License
2.18k stars 464 forks source link

How to find all valid BPEs for a word? #119

Closed hlthu closed 1 year ago

hlthu commented 1 year ago

For a word like "hello", how can I generate all valid BPEs for it, including:

hello
he@@ llo
he@@ l@@ l@@ o
h@@ e@@ l@@ l@@ o
.....
kopi22 commented 1 year ago

What I think you are really looking for is generating all possible partitions of a word. It is a common algorithmic problem with many possible solutions (e.g. see here)

rsennrich commented 1 year ago

you could use the solution suggested by @kopi22, and then filter out any results that contain substrings which are not valid subwords in your BPE vocabulary.

hlthu commented 1 year ago

Got it. Thx a lot.

On Mon, Apr 24, 2023 at 9:05 PM Rico Sennrich @.***> wrote:

you could use the solution suggested by @kopi22 https://github.com/kopi22, and then filter out any results that contain substrings which are not valid subwords in your BPE vocabulary.

— Reply to this email directly, view it on GitHub https://github.com/rsennrich/subword-nmt/issues/119#issuecomment-1520125080, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADVAENDYXMMOISFFAMBGRBLXCZ3CNANCNFSM6AAAAAAXIKZKF4 . You are receiving this because you authored the thread.Message ID: @.***>

-- 黄 露 Lu Huang