stanfordnlp / GloVe

Software in C and data files for the popular GloVe model for distributed word representations, a.k.a. word vectors or embeddings
Apache License 2.0
6.86k stars 1.51k forks source link

\0 inside "big data" causes problem in vocab #34

Closed chain closed 7 years ago

chain commented 8 years ago

When pasting together different data, there will be some garbage. If there are \0 bytes, then this will be included into vocab. Result is that vocab.txt will have one row with single column of data. This will cause the rest of the process to fail.

Maybe add a check for \0 and discard it.

ghost commented 8 years ago

Okay, sounds like a good find. If you're feeling generous, it would be great to get the solution as a pull request!

chain commented 8 years ago

Ok. I will do it as soon as possible.

ghost commented 8 years ago

Can you describe this problem in a little more detail so that we can look into it?

ghost commented 7 years ago

Please reopen this if you can describe the problem.