rsennrich / subword-nmt

Unsupervised Word Segmentation for Neural Machine Translation and Text Generation
MIT License
2.19k stars 465 forks source link

Intra-word boundary marker #83

Closed Darenar closed 4 years ago

Darenar commented 4 years ago

Hello,

I want to train vocabulary on the custom text corpora and lately to add this vocabulary to pre-trained BERT vocabulary. The thing is that pre-trained vocabulary has its intra-word boundary marker ## standing at the beginning of the continuing subword, for example:

At the same time, when I train new vocabulary, I get a word tokenised as :

Unfortunately, I could not find any way how to possible do it with subword-nmt. May be I am missing something?

rsennrich commented 4 years ago

Hi Darenar,

just use a string replacement after applying BPE, e.g.

sed "s/@@ / ##/g"

there's some technical reasons why I'm reluctant to support prepended segmentation markers out-of-the-box (related to https://github.com/rsennrich/subword-nmt/issues/19 ).