Intra-word boundary marker

Hello,

I want to train vocabulary on the custom text corpora and lately to add this vocabulary to pre-trained BERT vocabulary. The thing is that pre-trained vocabulary has its intra-word boundary marker ## standing at the beginning of the continuing subword, for example:

bird -> [bi, ##rd ]

At the same time, when I train new vocabulary, I get a word tokenised as :

bird -> [bi##, rd]

Unfortunately, I could not find any way how to possible do it with subword-nmt. May be I am missing something?

rsennrich / subword-nmt

Intra-word boundary marker #83