Closed xiaoleihuang closed 7 years ago
If your tokenizer is simply a more specific version of ours, you could simply preparse your data into space separated tokens. Does that work in your case?
Otherwise, I believe the token parsing starts here: https://github.com/stanfordnlp/GloVe/blob/master/src/vocab_count.c#L122
On Sun, Nov 20, 2016 at 8:41 PM, Xiaolei Huang notifications@github.com wrote:
How to use our own tokenizer? I did not find any parameters that support using our own tokenizer. Thank you.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/GloVe/issues/49, or mute the thread https://github.com/notifications/unsubscribe-auth/ABBSMecZWuhYOIWX7b_XW0YCSuwGx0jGks5rASDugaJpZM4K3zvI .
How to use our own tokenizer? I did not find any parameters that support using our own tokenizer. Like twitter, it might need different tokenizer. Should I pass tokenized corpus? Thank you.