stanfordnlp / GloVe

Software in C and data files for the popular GloVe model for distributed word representations, a.k.a. word vectors or embeddings
Apache License 2.0
6.86k stars 1.51k forks source link

How to use our own tokenizer? #49

Closed xiaoleihuang closed 7 years ago

xiaoleihuang commented 7 years ago

How to use our own tokenizer? I did not find any parameters that support using our own tokenizer. Like twitter, it might need different tokenizer. Should I pass tokenized corpus? Thank you.

ghost commented 7 years ago

If your tokenizer is simply a more specific version of ours, you could simply preparse your data into space separated tokens. Does that work in your case?

Otherwise, I believe the token parsing starts here: https://github.com/stanfordnlp/GloVe/blob/master/src/vocab_count.c#L122

On Sun, Nov 20, 2016 at 8:41 PM, Xiaolei Huang notifications@github.com wrote:

How to use our own tokenizer? I did not find any parameters that support using our own tokenizer. Thank you.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/GloVe/issues/49, or mute the thread https://github.com/notifications/unsubscribe-auth/ABBSMecZWuhYOIWX7b_XW0YCSuwGx0jGks5rASDugaJpZM4K3zvI .