How to use our own tokenizer?

If your tokenizer is simply a more specific version of ours, you could simply preparse your data into space separated tokens. Does that work in your case?

Otherwise, I believe the token parsing starts here: https://github.com/stanfordnlp/GloVe/blob/master/src/vocab_count.c#L122

On Sun, Nov 20, 2016 at 8:41 PM, Xiaolei Huang notifications@github.com wrote:

How to use our own tokenizer? I did not find any parameters that support using our own tokenizer. Thank you.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/GloVe/issues/49, or mute the thread https://github.com/notifications/unsubscribe-auth/ABBSMecZWuhYOIWX7b_XW0YCSuwGx0jGks5rASDugaJpZM4K3zvI .

stanfordnlp / GloVe

How to use our own tokenizer? #49