Closed Darenar closed 4 years ago
Hi Darenar,
just use a string replacement after applying BPE, e.g.
sed "s/@@ / ##/g"
there's some technical reasons why I'm reluctant to support prepended segmentation markers out-of-the-box (related to https://github.com/rsennrich/subword-nmt/issues/19 ).
Hello,
I want to train vocabulary on the custom text corpora and lately to add this vocabulary to pre-trained BERT vocabulary. The thing is that pre-trained vocabulary has its intra-word boundary marker ## standing at the beginning of the continuing subword, for example:
At the same time, when I train new vocabulary, I get a word tokenised as :
Unfortunately, I could not find any way how to possible do it with subword-nmt. May be I am missing something?