Problem with reproducing results

suhaibani / JointReps

Learning word representation jointly using a corpus and a knowledge base (KB)

MIT License

18 stars 1 forks source link

Problem with reproducing results #1

Closed svjan5 closed 5 years ago

svjan5 commented 5 years ago

Hi, I am facing some issues while using JointReps on a new corpus. I used GloVe code for generating cooccurrence matrix and then used the provided code directly. The obtained embeddings are performing quite poorly on intrinsic tasks like word similarity and analogy. Also, the code uses a single thread which makes is very slow compared to word2vec and GloVe. Is there any mistake which we are making in using the provided code.

Thanks in advance

suhaibani commented 5 years ago

Hi.

What did you use as a lexicon? Have you used one of the relations provided in this repository? And how many iterations did you run? In this experiment, it takes under 50 mins for 20 iterations to learn 300-dimensional word representations for 434,826 words from the ukWaC corpus on a Xeon 2.9GHz 32 core 512GB RAM machine. You can try those edges from ukWaC. You may also try to initialize the vectors by a pre-trained embedding.

svjan5 commented 5 years ago

Hi, I used synonyms as lexicons while training the word embeddings. Training was done for 20 epochs (default configuration). Even on using the above provide ukWac co-occurrence counts, I am getting poor results on intrinsic tasks (compared to GloVe and word2vec).

I am using Eigen 3.3.5 version and boost library 1.68 version. The code doesn't utilize multiple cores while training and takes considerably more time than 50 mins for training on ukWaC edges. This is the command I am using for starting the training:

./reps --dim=50 --epohs=20 --model=../work/model_wack --alpha=0.01 --lmda=10000 --edges=../ukWAC_edges --pairs=../work/synonyms

Please let me know if there is any mistake in the procedure.

suhaibani commented 5 years ago

Hi,

When you say "compared to GloVe", are you using the same experimental settings (i.e. same corpus, same hyperparameters..etc)? For example, the results reported in the paper are with 300 dim, I see that you're using only 50 dim. Also, have you noticed that it did converge to a solution with 20 epochs (it is printed out)? I would suggest starting debugging the issue with setting Lambda to zero (i.e. only training using the corpus, without the constraints from KB, basically equation 3 is ignored), it should give similar results to GloVe, equation 2 is basically GloVe.

svjan5 commented 5 years ago

Hi, Initializing embeddings with GloVe gave me the reported performance. Thanks a lot for your help.

suhaibani commented 5 years ago

Initialising with pre-trained should only help to converge faster, random initialization must produce similar results. You're very welcome.

jackyuanjie1990 commented 5 years ago

Hi,

I have a problem when I try to use your code on a new corpus. I used GloVe code to generate a cooccurrrence.bin file and then I want to use your code directly. However, I found that your code can't use cooccurrrence.bin as input. ( I'm not familiar with C++, maybe make some mistakes ). If possible, could you share a tool to generate a co-occurrence matrix which satisfies the input of your codes?

Thanks, Jack

sanjanasri commented 5 years ago

Hi, I am facing some issues while using JointReps on a new corpus. I used GloVe code for generating cooccurrence matrix and then used the provided code directly. The obtained embeddings are performing quite poorly on intrinsic tasks like word similarity and analogy. Also, the code uses a single thread which makes is very slow compared to word2vec and GloVe. Is there any mistake which we are making in using the provided code.

Thanks in advance

Hi,

 The glove cooccurence bin gives vocab indices instead of vocabularyt rite.. How do you get as string string values. It would be great if you could sent me an earnest reply