vistec-AI / mt-opus

English-Thai Machine Translation with OPUS data
19 stars 5 forks source link

Build a SentencePiece BPE dictionary for Thai and English on the Opensubtitles v2018 dataset. #1

Closed lalital closed 4 years ago

lalital commented 4 years ago

Due date:

Build a BPE dictionary for Thai and English on the Opensubtitles v2018 dataset without text normalization. It will be done by SentencePiece trainer (Unigram Model).

For example

  1. a word with an uppercase character "Charin" will be not normalized to the lower case form (i.e."charin").

  2. Digits will be not converted into only "0". For example, it should not convert "2450" to "0000" as BPEmb does. (see an example of Thai BPE by BPEmb)


  1. SentencePiece joint-BPE vocab (vocab size=25k)
.   -2.91401
▁   -3.61407
,   -3.63489
'   -3.67749
?   -3.94575
▁I  -4.01772
s   -4.09198
▁you    -4.27456
▁the    -4.45016
▁to -4.53639
▁a  -4.78158
t   -4.85039
!   -4.90617
▁it -5.17097
▁that   -5.28924
▁of -5.32769
... -5.36401
▁You    -5.43749
▁and    -5.4791
▁me -5.48047
▁in -5.51462
▁is -5.52096
m   -5.55835
re  -5.66259
▁for    -5.82898
▁this   -5.83481
▁have   -5.88394
▁know   -5.88886
▁we -5.91115
▁your   -5.91683
▁was    -5.92516
▁on -5.94044
▁be -5.94804
▁my -5.95526
▁What   -6.01306
▁not    -6.04481
ที่ -6.06585
▁do -6.06909
▁It -6.10684
▁can    -6.12583
▁are    -6.14591
▁with   -6.15366
▁don    -6.1591
▁he -6.16247
▁what   -6.16729
นะ  -6.2478
▁just   -6.27134
▁ฉัน    -6.28194
▁We -6.28892
▁คุณ    -6.29413
-   -6.30012
ll  -6.35364
ได้ -6.35809
▁here   -6.36225
▁And    -6.3672
▁No -6.37179
d   -6.39813
▁like   -6.40243
▁out    -6.41462
แล้ว    -6.44016
▁about  -6.44729
▁him    -6.47555
▁all    -6.48275
▁get    -6.49238
▁และ    -6.5073
▁her    -6.52256
▁He -6.52931
▁right  -6.54361
คุณ -6.55339
▁ไม่    -6.56311
▁The    -6.57471
ของ -6.57641
▁up -6.58294
ไป  -6.6062
ve  -6.61348
▁go -6.61592
มา  -6.6256
ฉัน -6.6536
▁"  -6.65902
▁That   -6.66391
▁so -6.66659
เลย -6.6769
▁there  -6.67703
ing -6.6805
▁at -6.6835
▁ผม -6.68542
▁one    -6.70543
▁want   -6.71774
▁Oh -6.71943
▁no -6.73413
▁think  -6.73962
▁Yeah   -6.74808
▁but    -6.76069
▁got    -6.79181
ed  -6.79689
▁she    -6.79781
▁But    -6.81544
▁ใช่    -6.81546
▁they   -6.83402
▁เธอ    -6.84031


The token ▁You means You that is placed at the start of the sentence. On the other hand, the token▁you means you is after a space token.