nfmcclure / tensorflow_cookbook

Code for Tensorflow Machine Learning Cookbook
https://www.packtpub.com/big-data-and-business-intelligence/tensorflow-machine-learning-cookbook-second-edition
MIT License
6.23k stars 2.41k forks source link

Since the first training is done, document embedding hasn't been trained #141

Closed joonable closed 6 years ago

joonable commented 6 years ago

Hello. I'm trying to use doc2vec algorithm in 07_Natural_Language_Processing/07_Sentiment_Analysis_With_Doc2Vec/07_sentiment_with_doc2vec.py.

I understood that first of the training is to train word and doc embedding and second one is for text classification, sentiment analysis. Because I needed distributed representations of words and docs, not a classifier, so just did first training.

After the training, I evaluated the vectors in word and document embedding using tf.saver, then found out doc embedding didn't change, but word embedding did. Doc embedding just stayed as initial value.

Did I understand the code and doc2vec algorithm not properly or is there any kind of bug in the code? Thank you for your answer in advanced.

nfmcclure commented 6 years ago

Hi @joonable , Thanks for asking.

I just checked this out briefly. I think I'll need more information as I cannot replicate the problem. For example, if I run the code through the variable initialization and create a feed-dictionary, then I run the following commands:

In[39]: sess.run(doc_embed, feed_dict=feed_dict)
Out[39]: 
array([[[ 0.36113167, -0.42523894,  0.08636531, ...,  0.9411001 ,
         -0.8095024 , -0.38859203]],

       [[ 0.36113167, -0.42523894,  0.08636531, ...,  0.9411001 ,
         -0.8095024 , -0.38859203]],

       [[ 0.36113167, -0.42523894,  0.08636531, ...,  0.9411001 ,
         -0.8095024 , -0.38859203]],

       ...,

       [[ 0.7726636 , -0.4221473 , -0.28463227, ..., -0.00291947,
          0.49912193, -0.26189896]],

       [[ 0.7726636 , -0.4221473 , -0.28463227, ..., -0.00291947,
          0.49912193, -0.26189896]],

       [[ 0.7726636 , -0.4221473 , -0.28463227, ..., -0.00291947,
          0.49912193, -0.26189896]]], dtype=float32)
In[40]: sess.run(train_step, feed_dict=feed_dict)
In[41]: sess.run(doc_embed, feed_dict=feed_dict)
Out[41]: 
array([[[ 0.3611314 , -0.42523894,  0.08636572, ...,  0.94110006,
         -0.8095023 , -0.38859165]],

       [[ 0.3611314 , -0.42523894,  0.08636572, ...,  0.94110006,
         -0.8095023 , -0.38859165]],

       [[ 0.3611314 , -0.42523894,  0.08636572, ...,  0.94110006,
         -0.8095023 , -0.38859165]],

       ...,

       [[ 0.7726636 , -0.42214715, -0.2846323 , ..., -0.00291951,
          0.49912196, -0.26189905]],

       [[ 0.7726636 , -0.42214715, -0.2846323 , ..., -0.00291951,
          0.49912196, -0.26189905]],

       [[ 0.7726636 , -0.42214715, -0.2846323 , ..., -0.00291951,
          0.49912196, -0.26189905]]], dtype=float32)

This shows me that the variable doc_embed is changing due to the training. Are you seeing something different? If so, make sure you have the most up-to-date code and also let me know your python and TensorFlow versions.

I'll continue to troubleshoot with you if you see something different. I think the next step would be to fix a random seed for TensorFlow and Numpy and see what we can do, assuming we have the same versions for everything. For reference, I'm running Python 3.6 and TensorFlow v1.10.1.

Thanks.

joonable commented 6 years ago

I checked it as the way you presented and there is surely a difference after the training. Many appolgies add the code below.

`

In[32]: doc_origin = doc_embeddings.eval(sess) In[33]: for i in range(5000) : sess.run(train_step, feed_dict=feed_dict) In[34]: doc_eval = doc_embeddings.eval(sess) In[35]: doc_origin - doc_eval Out[35]:
array([[ 7.57873058e-05, 5.39273024e-05, -3.54051590e-05, ..., 3.53846699e-05, 4.13656235e-05, -6.90221786e-05], [-1.03056431e-04, 3.75509262e-06, 6.04987144e-05, ..., -3.19242477e-04, -1.00910664e-04, -1.72302127e-04], [-1.60932541e-06, 3.51667404e-06, -8.94069672e-07, ..., -2.14576721e-06, -4.35113907e-06, 1.75833702e-06], ..., [-1.67489052e-04, 1.15454197e-04, 2.23517418e-05, ..., -2.02655792e-06, -8.34465027e-06, 5.33461571e-05], [-4.18424606e-05, -1.31400302e-05, -2.86102295e-05, ..., 1.12056732e-05, -6.37024641e-06, 4.05311584e-06], [-1.25169754e-05, -2.87890434e-05, 1.23977661e-05, ..., -1.21593475e-05, -6.26444817e-05, 5.59091568e-05]], dtype=float32)

`

It's not about troubleshooting, but I have a problem to solve. I'm using doc2vec for clustering unlabelled documents. However, as you see, the difference is too small that they just stay in random.uniform as initialised. I trained them with enough iterations then the losses at every step don't converge anymore though.

Even after more than 200K iterations, doc2vec don't changed a lot.

`

In[36]: for i in range(200000) : sess.run(train_step, feed_dict=feed_dict) In[37]: doc_eval_200K = doc_embeddings.eval(sess) In[38]: doc_origin Out[38]: array([[-0.40346146, -0.22738123, 0.6981292 , ..., 0.02518272, 0.6519067 , 0.5756016 ], [-0.71823335, 0.9682684 , -0.47529078, ..., -0.44264603, -0.84275126, 0.1408112 ], [-0.91523314, 0.63673115, 0.33543396, ..., -0.635123 , 0.8932848 , -0.0469408 ], ..., [-0.95611143, 0.63165283, 0.20844555, ..., -0.95574784, 0.803643 , 0.8626468 ], [-0.87971663, -0.00883818, 0.8690052 , ..., -0.9107895 , 0.11327219, 0.52236867], [ 0.9117298 , 0.5722585 , 0.87356305, ..., -0.65226054, -0.31751704, -0.7709594 ]], dtype=float32) In[39]: doc_eval_200K Out[39]: array([[-0.40350893, -0.22704063, 0.6981595 , ..., 0.02526901, 0.6520689 , 0.57503295], [-0.71688116, 0.9653056 , -0.47172707, ..., -0.44319224, -0.83652633, 0.13944209], [-0.91519636, 0.6366936 , 0.335434 , ..., -0.6351552 , 0.89335924, -0.04687748], ..., [-0.9555648 , 0.6306714 , 0.20914698, ..., -0.955652 , 0.8043847 , 0.86161727], [-0.8796184 , -0.00869403, 0.8691123 , ..., -0.91070646, 0.11326376, 0.52240765], [ 0.91199374, 0.57255834, 0.8732707 , ..., -0.65196055, -0.3172496 , -0.7709833 ]], dtype=float32)

`

When to use gensim, I can see clear difference but I should use tensorflow for my research to transform the algorithm. If you give me any advice, will be definitely helpful. Thank you.

nfmcclure commented 6 years ago

Hi @joonable , They do change very slowly, I agree. You can try a few things:

I hope that helps!