stanfordnlp / GloVe

Software in C and data files for the popular GloVe model for distributed word representations, a.k.a. word vectors or embeddings
Apache License 2.0
6.86k stars 1.51k forks source link

Document Preprocessing Step in src/README.md #144

Open Ayush-iitkgp opened 5 years ago

Ayush-iitkgp commented 5 years ago

Hello everyone, I was trying to generate my own word embedding using my corpus (60 million lines). But there was no documentation about how the space separated file is generated (there is a not very informative line that says to use Stanford's preprocessing library). This lack of documentation is causing 2 problems:

  1. The intersection between my text corpus's vocabulary is around 20% (I have a feeling that because I am using my own text-preprocessing method which is creating words very different from the one in GLove)
  2. Since there is no documentation about the preprocessing step. I am not sure if I should put words in different lines in the same context because they are clearly not which causes a lot of confusion.

I hope someone would help in writing the pre-processing step documentation.