Hello everyone,
I was trying to generate my own word embedding using my corpus (60 million lines). But there was no documentation about how the space separated file is generated (there is a not very informative line that says to use Stanford's preprocessing library).
This lack of documentation is causing 2 problems:
The intersection between my text corpus's vocabulary is around 20% (I have a feeling that because I am using my own text-preprocessing method which is creating words very different from the one in GLove)
Since there is no documentation about the preprocessing step. I am not sure if I should put words in different lines in the same context because they are clearly not which causes a lot of confusion.
I hope someone would help in writing the pre-processing step documentation.
Hello everyone, I was trying to generate my own word embedding using my corpus (60 million lines). But there was no documentation about how the space separated file is generated (there is a not very informative line that says to use Stanford's preprocessing library). This lack of documentation is causing 2 problems:
I hope someone would help in writing the pre-processing step documentation.