Unrecognized Arabic Text in a Corpus and Understanding the Content of the model file

medallia / Word2VecJava

Word2Vec Java Port

MIT License

186 stars 81 forks source link

Unrecognized Arabic Text in a Corpus and Understanding the Content of the model file #17

Open ml-tn opened 9 years ago

ml-tn commented 9 years ago

Hi,

I am trying to use your implementation of Word2Vec to generate features for my text. My Corpus is in arabic. When running Word2VecExamples on the file containing the sentences all the words won't be recognized and will be displayed as a sequence of "?". Even in the generated model, I get the same issue:

  {"1":{"lst":["str",666,"","??","??","???","????","?",",","??","??","..","??","??"

First, how could I fix this problem ? Then, how to interpret the content of the generated model file ?

Thank for your help :)

guerda commented 9 years ago

Interesting problem. Could you please provide an Arabic text corpus file and a model file? This would help the debugging.

The generated model file cannot be interpreted that easily. It is a representation of all dimensions and the trained vocabulary. To "use" the model, just use the provided methods. For some examples, you can look into the Word2VecExample.java file within the source tree.

guerda commented 9 years ago

Hi @ml-tn !

I just pushed a test case in a branch (see above) which tests the training of an Arabic text. I used an example text from Wikipedia (it should be a snippet about Ramadi, but I don't know exactly). I trained the model via the following code:

 Word2VecModel.trainer()
            .setMinVocabFrequency(6)
            .useNumThreads(1)
            .setWindowSize(8)
            .type(NeuralNetworkType.CBOW)
            .useHierarchicalSoftmax()
            .setLayerSize(25)
            .setDownSamplingRate(1e-3)
            .setNumIterations(1);

Then I checked the trained vocab for the word الرمادي and it was found in the vocab. Also, the model returns some similar words to that. So it should work out without any problems with Arabic text. Is your text corpus encoded in UTF-8? Have you loaded it with UTF-8 compatible methods?

Hronom commented 8 years ago

Hello, I get this problem when I save trained model using Word2VecModel.toBinFile, and after load of bin file(Word2VecModel.fromBinFile) I get bad chars.

In my case it's russian language.

eikdk commented 8 years ago

I had the exact same problem as Hronom, until I merged all the different branches into my tree. My best guest is that the tree from wangyum solved the problem.

Eik

Hronom commented 8 years ago

@eikdk yes you are right, I currently merge branch from this pull request https://github.com/medallia/Word2VecJava/pull/34 and this solves this problem

eikdk commented 8 years ago

@Hronom Nice to hear that the problem is solved.

tareqabufayad commented 8 years ago

Hello @guerda, i tried a lot of creating Arabic Model using word2vec, but it doesn't work, can you help me out, thank you alot