Closed divyeshrajpura4114 closed 3 months ago
In general, it is not possible to define the constraint not to split the token. For instance, we cannot merge all numeric characters e.g., 0-9 we will see infinite number of tokens with this merges rule after training. Does this phonetic merge rule can generate infinite combinations of substrings?
Will close this issue on 5/31.
Sure. I have figured another work around and it seems working fine as of now. Thanks!
Hi,
Is there any way we can define a set of sub-words to be not split but still considered for token generation. This is especially required for phonetically rich languages like Hindi.
Ex: मैं दिव्येश राजपुरा हूं (I am Divyesh Rajpura) In the above example, the sub-words such as, मैं (me), दि (di), व्ये (vye), पु (pu), रा (ra), हूं (hu) should never get split and should be considered as a single unit when generating BPE tokens
Thanks & Regards, Divyesh Rajpura