guillaume-chevalier / GloVe-as-a-TensorFlow-Embedding-Layer

Taking a pretrained GloVe model, and using it as a TensorFlow embedding weight layer **inside the GPU**. Therefore, you only need to send the index of the words through the GPU data transfer bus, reducing data transfer overhead.
https://www.neuraxio.com/
MIT License
90 stars 19 forks source link

Loading Tensor #2

Open suatfk opened 5 years ago

suatfk commented 5 years ago

Thanks for perfect work inspired me a lot.

Here is the story.

Im currently working with tensorflow.datasets.imdb dataset. i decided to use Glove word embeddings(300d) with my toy project but imdb dataset contains only word indexes like [1,3,515,...] not the words which mean basically comes with internal word index.

So i decided to convert this indexes to glove word embedding indexes to use embeddings. Here is conversation which i try to implement totally in tensorflow for learning purposes.

imdb_dataset -> imdb_index_to_word_dict -> glove_word_to_index -> glove_word_embedding [12, 325, 123,... ] -> ["the", "equal", "append"] -> [15, 645, 722,...] -> [[][]] (shape (n_word,300d))

Here is my code:


glove_word_dict, tf_embedding = glove_utils.as_tensor.load_embedding_and_dict(data_folder, glove_name,
                                                                              glove_dimension, sess)

glove_dict_array = []
for key, value in glove_word_dict.items():
    glove_dict_array.insert(value, key)

(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=10000)

imdb_word_index_dict = imdb.get_word_index()

imdb_word_index_array = []
for key, value in imdb_word_index_dict.items():
    key = key.replace("\\", "").replace("'", "")
    imdb_word_index_array.insert(value, key)

indices = tf.placeholder(dtype=tf.int64, shape=(None,))

imdb_word_index_tf_table = tf.contrib.lookup.index_to_string_table_from_tensor(tf.constant(imdb_word_index_array),
                                                                               default_value="UNKNOWN")

glove_word_dict_tf_table = tf.contrib.lookup.index_table_from_tensor(mapping=tf.constant(glove_dict_array),
                                                                     default_value=1)

word_index_tf_indices = imdb_word_index_tf_table.lookup(indices)

glove_indices = glove_word_dict_tf_table.lookup(word_index_tf_indices)

result = tf.nn.embedding_lookup(params=tf_embedding, ids=glove_indices)

sess.run(tf.tables_initializer())

sess.run(tf.global_variables_initializer())

embedding, glove_indices_result  = sess.run([ result, glove_indices], feed_dict={
    indices: [4, 4, 4, 4, 4]
})

Here is the problem

When i tried to run every time this code block above glove_indices_result contains true values but somehow embedding return only default value with bunch of zeros.

and Here is the solution

i changed this code block when used loading tf_embedding ( embedding tensor)

 #1. Define the variable that will hold the embedding:
  tf_embedding = tf.Variable(
  tf.constant(1.0, shape=shape),
            trainable=False,
         name="Embedding"
     )

 # 2. Restore the embedding from disks to TensorFlow, GPU (or CPU if GPU unavailable):

with this

#1. Define the variable that will hold the embedding:
tf_embedding = tf.get_variable(
        name='Embedding',
        shape=shape,
        trainable=False)
 # 2. Restore the embedding from disks to TensorFlow, GPU (or CPU if GPU unavailable):

and it is working like charm now. Thanks for the perfect work and your time. This issue take half day from me so i dont want this issue take someone else's time.

Thank you.

guillaume-chevalier commented 5 years ago

Hi @suatfk, thanks for sharing! I'll wait before TensorFlow 2.0 to come before changing this.

Maybe that this problem was caused by an update or probably if you try to declare two tensors with the same name, hence why you would need get_variable. Or maybe the global_variables_initializer was overriding the values with zeros... interesting. I'll check that when refactoring.