rsennrich / subword-nmt

Unsupervised Word Segmentation for Neural Machine Translation and Text Generation
MIT License
2.18k stars 464 forks source link

Extending --glossaries to handle regex #56

Closed Proyag closed 6 years ago

Proyag commented 6 years ago

Extends the functionality of the --glossaries argument of apply_bpe.py to enable using regex.

For example: python apply_bpe.py -c codes_file -i input_text --glossaries "string1" "string2" "<tag>\w*</tag>" "\d+" will ensure string1, string2, words enclosed in <tag> </tag> and numbers will not be split by BPE and will be isolated from other subwords.

Helps with #49.

rsennrich commented 6 years ago

thanks, this looks good!