Tokenization for phonetic languages

divyeshrajpura4114 commented 4 months ago

Hi,

Is there any way we can define a set of sub-words to be not split but still considered for token generation. This is especially required for phonetically rich languages like Hindi.

Ex: मैं दिव्येश राजपुरा हूं (I am Divyesh Rajpura) In the above example, the sub-words such as, मैं (me), दि (di), व्ये (vye), पु (pu), रा (ra), हूं (hu) should never get split and should be considered as a single unit when generating BPE tokens

Thanks & Regards, Divyesh Rajpura

taku910 commented 4 months ago

In general, it is not possible to define the constraint not to split the token. For instance, we cannot merge all numeric characters e.g., 0-9 we will see infinite number of tokens with this merges rule after training. Does this phonetic merge rule can generate infinite combinations of substrings?

taku910 commented 3 months ago

Will close this issue on 5/31.

divyeshrajpura4114 commented 3 months ago

Sure. I have figured another work around and it seems working fine as of now. Thanks!

google / sentencepiece

Tokenization for phonetic languages #1009