rsennrich / subword-nmt

Unsupervised Word Segmentation for Neural Machine Translation and Text Generation
MIT License
2.18k stars 464 forks source link

How to avoid special char like '\t' being split by bpe #112

Closed rangehow closed 2 years ago

rangehow commented 2 years ago

Hi, I have some problem in how to avoid '\t' being split into '@@ \t@@ '

I have tried glossaries after reading readme.me like this python subword-nmt/apply_bpe.py --glossaries \t,\n -c \ wmt17_en_de/code <wmt17_en_de/tmp/constraint.en-de >wmt17_en_de/bpe.nocommer

but it seems doesn't work, hope get some help from you : )

rsennrich commented 2 years ago

don't join multiple glossary entries by comma, but add space-separated and each enclosed in quotation marks. Finally, it seems that you need to escape backslashes:

--glossaries "\\\t" "\\\n" works for me in bash.

rangehow commented 2 years ago

Really appreciate it!! That works for me too. Maybe this specific usage about --glossaries can be add into readme.md and help in order to help more. Thanks for your in-time reply again!

rsennrich commented 2 years ago

I've now added an example to the Readme that shows how multiple glossary entries can be passed to subword-nmt.