Open ml-tn opened 9 years ago
Interesting problem. Could you please provide an Arabic text corpus file and a model file? This would help the debugging.
The generated model file cannot be interpreted that easily. It is a representation of all dimensions and the trained vocabulary. To "use" the model, just use the provided methods. For some examples, you can look into the Word2VecExample.java file within the source tree.
Hi @ml-tn !
I just pushed a test case in a branch (see above) which tests the training of an Arabic text. I used an example text from Wikipedia (it should be a snippet about Ramadi, but I don't know exactly). I trained the model via the following code:
Word2VecModel.trainer()
.setMinVocabFrequency(6)
.useNumThreads(1)
.setWindowSize(8)
.type(NeuralNetworkType.CBOW)
.useHierarchicalSoftmax()
.setLayerSize(25)
.setDownSamplingRate(1e-3)
.setNumIterations(1);
Then I checked the trained vocab for the word الرمادي and it was found in the vocab. Also, the model returns some similar words to that. So it should work out without any problems with Arabic text. Is your text corpus encoded in UTF-8? Have you loaded it with UTF-8 compatible methods?
Hello, I get this problem when I save trained model using Word2VecModel.toBinFile, and after load of bin file(Word2VecModel.fromBinFile) I get bad chars.
In my case it's russian language.
I had the exact same problem as Hronom, until I merged all the different branches into my tree. My best guest is that the tree from wangyum solved the problem.
Eik
@eikdk yes you are right, I currently merge branch from this pull request https://github.com/medallia/Word2VecJava/pull/34 and this solves this problem
@Hronom Nice to hear that the problem is solved.
Hello @guerda, i tried a lot of creating Arabic Model using word2vec, but it doesn't work, can you help me out, thank you alot
Hi,
I am trying to use your implementation of Word2Vec to generate features for my text. My Corpus is in arabic. When running Word2VecExamples on the file containing the sentences all the words won't be recognized and will be displayed as a sequence of "?". Even in the generated model, I get the same issue:
First, how could I fix this problem ? Then, how to interpret the content of the generated model file ?
Thank for your help :)