justhalf / bpe_analysis

Analysis of BPE on four languages: English, Indonesian, Chinese, Japanese
0 stars 1 forks source link

BPE Analysis on Four Languages

11 Apr 2019

Zhisong Zhang, Naoki Otani, Aldrian Obaja Muis

Project done for 11-821 Linguistics Seminar course in CMU, Spring 2019.

In this project we aim to analyze from linguistics perspective the segmentation behavior of BPE.

Scripts

To calculate the type counts for each language, run bash calc_type_count.bash

To run BPE experiments on various vocabulary sizes, run bash run_bpe.bash. This assumes you have installed sentencepiece Python wrapper pip install sentencepiece.