nateraw / Lda2vec-Tensorflow

Tensorflow 1.5 implementation of Chris Moody's Lda2vec, adapted from @meereeum
MIT License
108 stars 40 forks source link

How do I load trained models? (20 news groups example) #62

Closed KyungB closed 4 years ago

KyungB commented 4 years ago

After running the run_20newsgroups.py I got model.ckpt, .meta, and .index files in logdir_yymmdd_nnnn. So I tried loading the model by:

from lda2vec import utils, model
'import numpy as np
# Path to preprocessed data
data_path  = "data/clean_data"
# Whether or not to load saved embeddings file
load_embeds = True

# Load data from files
(idx_to_word, word_to_idx, freqs, pivot_ids,
 target_ids, doc_ids, embed_matrix) = utils.load_preprocessed_data(data_path, load_embed_matrix=load_embeds)

# Number of unique documents
num_docs = len(np.unique(doc_ids))
print('Num docs: %f'%num_docs)
# Number of unique words in vocabulary (int)
vocab_size = embed_matrix.shape[0] 
print('Vocab size: %f'%vocab_size)
# Embed layer dimension size
# If not loading embeds, change 128 to whatever size you want.
embed_size = embed_matrix.shape[1] if load_embeds else 128
print('Embed size: %f'%embed_size)
# Number of topics to cluster into
num_topics = 16
# Epoch that we want to "switch on" LDA loss
switch_loss_epoch = 5
# Pretrained embeddings
pretrained_embeddings = embed_matrix if load_embeds else None
# If True, save logdir, otherwise don't
save_graph = True
num_epochs = 100
batch_size = 8192 #4096

# Initialize the model
m = model(num_docs,
          vocab_size,
          num_topics,
          embedding_size=embed_size,
          pretrained_embeddings=pretrained_embeddings,
          freqs=freqs,
          batch_size = batch_size,
          save_graph_def=save_graph,
          restore=True,
          logdir='logdir_191205_2320')

# Train the model
# m.train(pivot_ids,
#         target_ids,
#         doc_ids,
#         len(pivot_ids),
#         num_epochs,
#         idx_to_word=idx_to_word,
#         switch_loss_epoch=switch_loss_epoch)

# Visualize topics with pyldavis
utils.generate_ldavis_data(data_path, m, idx_to_word, freqs, vocab_size)

and I am getting a following error:

/Lda2vec-Tensorflow/tests/twenty_newsgroups
Using TensorFlow backend.
Num docs: 10722.000000
Vocab size: 5196.000000
Embed size: 300.000000
WARNING:tensorflow:From /home/ec2-user/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/lda2vec-0.16.10-py3.6.egg/lda2vec/Lda2vec.py:40: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.

WARNING:tensorflow:From /home/ec2-user/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/lda2vec-0.16.10-py3.6.egg/lda2vec/Lda2vec.py:43: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.

2019-12-09 21:12:17.346986: I tensorflow/core/platform/cpu_feature_guard.cc:145] This TensorFlow binary is optimized with Intel(R) MKL-DNN to use the following CPU instructions in performance critical operations:  AVX512F
To enable them in non-MKL-DNN operations, rebuild TensorFlow with the appropriate compiler flags.
2019-12-09 21:12:17.378228: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2999995000 Hz
2019-12-09 21:12:17.378444: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55617ec28b80 executing computations on platform Host. Devices:
2019-12-09 21:12:17.378460: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): <undefined>, <undefined>
2019-12-09 21:12:17.378600: I tensorflow/core/common_runtime/process_util.cc:115] Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best performance.
WARNING:tensorflow:From /home/ec2-user/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/lda2vec-0.16.10-py3.6.egg/lda2vec/Lda2vec.py:105: The name tf.train.import_meta_graph is deprecated. Please use tf.compat.v1.train.import_meta_graph instead.

WARNING:tensorflow:From /home/ec2-user/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/python/training/saver.py:1282: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file APIs to check for files with this prefix.
2019-12-09 21:12:17.862352: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1412] (One-time warning): Not using XLA:CPU for cluster because envvar TF_XLA_FLAGS=--tf_xla_cpu_global_jit was not set.  If you want XLA:CPU, either set that envvar, or use experimental_jit_scope to enable XLA:CPU.  To confirm that XLA is active, pass --vmodule=xla_compilation_cache=1 (as a proper command-line flag, not via TF_XLA_FLAGS) or set the envvar XLA_FLAGS=--xla_hlo_profile.
Traceback (most recent call last):
  File "visualize_20newsgroups.py", line 56, in <module>
    utils.generate_ldavis_data(data_path, m, idx_to_word, freqs, vocab_size)
  File "/home/ec2-user/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/lda2vec-0.16.10-py3.6.egg/lda2vec/utils.py", line 172, in generate_ldavis_data
  File "/home/ec2-user/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/lda2vec-0.16.10-py3.6.egg/lda2vec/utils.py", line 67, in prepare_topics
AssertionError: Vocabulary size did not match size of word vectors
dbl001 commented 4 years ago

The preprocessing step generates numpy matrices and pickle files with files for word-to-index, index-to-word, embedding vectors, etc.

$ ls -l data/clean_data total 205744 -rw-r--r-- 1 davidlaxer staff 85064 Apr 8 2019 doc_lengths.npy -rw-r--r-- 1 davidlaxer staff 8551328 Apr 8 2019 embedding_matrix.npy -rw-r--r-- 1 davidlaxer staff 28632 Apr 8 2019 freqs.npy -rw-r--r-- 1 davidlaxer staff 68335 Apr 8 2019 idx_to_word.pickle -rw-r--r-- 1 davidlaxer staff 96531372 Apr 8 2019 skipgrams.txt -rw-r--r-- 1 davidlaxer staff 68335 Apr 8 2019 word_to_idx.pickle

  1. what do you see running ‘wc’ on the following files:

(ai) MacBook-Pro:twenty_newsgroups davidlaxer$ wc data/clean_data/embedding_matrix.npy 8461 331257 8551328 data/clean_data/embedding_matrix.npy (ai) MacBook-Pro:twenty_newsgroups davidlaxer$ wc data/clean_data/word_to_idx.pickle 732 3312 68335 data/clean_data/word_to_idx.pickle (ai) MacBook-Pro:twenty_newsgroups davidlaxer$ wc data/clean_data/idx_to_word.pickle 732 3312 68335 data/clean_data/idx_to_word.pickle

  1. what’s the size of ‘word_embed'? word_embed = m.sesh.run(model.w_embed.embedding)

  2. is it the dimension as wc data/clean_data/embedding_matrix.npy

On Dec 9, 2019, at 1:22 PM, KyungB notifications@github.com wrote:

After running the run_20newsgroups.py I got model.ckpt, .meta, and .index files in logdir_yymmdd_nnnn. So I tried loading the model by:

from lda2vec import utils, model 'import numpy as np

Path to preprocessed data

data_path = "data/clean_data"

Whether or not to load saved embeddings file

load_embeds = True

Load data from files

(idx_to_word, word_to_idx, freqs, pivot_ids, target_ids, doc_ids, embed_matrix) = utils.load_preprocessed_data(data_path, load_embed_matrix=load_embeds)

Number of unique documents

num_docs = len(np.unique(doc_ids)) print('Num docs: %f'%num_docs)

Number of unique words in vocabulary (int)

vocab_size = embed_matrix.shape[0] print('Vocab size: %f'%vocab_size)

Embed layer dimension size

If not loading embeds, change 128 to whatever size you want.

embed_size = embed_matrix.shape[1] if load_embeds else 128 print('Embed size: %f'%embed_size)

Number of topics to cluster into

num_topics = 16

Epoch that we want to "switch on" LDA loss

switch_loss_epoch = 5

Pretrained embeddings

pretrained_embeddings = embed_matrix if load_embeds else None

If True, save logdir, otherwise don't

save_graph = True num_epochs = 100 batch_size = 8192 #4096

Initialize the model

m = model(num_docs, vocab_size, num_topics, embedding_size=embed_size, pretrained_embeddings=pretrained_embeddings, freqs=freqs, batch_size = batch_size, save_graph_def=save_graph, restore=True, logdir='logdir_191205_2320')

Train the model

m.train(pivot_ids,

target_ids,

doc_ids,

len(pivot_ids),

num_epochs,

idx_to_word=idx_to_word,

switch_loss_epoch=switch_loss_epoch)

Visualize topics with pyldavis

utils.generate_ldavis_data(data_path, m, idx_to_word, freqs, vocab_size) and I am getting a following error:

/Lda2vec-Tensorflow/tests/twenty_newsgroups Using TensorFlow backend. Num docs: 10722.000000 Vocab size: 5196.000000 Embed size: 300.000000 WARNING:tensorflow:From /home/ec2-user/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/lda2vec-0.16.10-py3.6.egg/lda2vec/Lda2vec.py:40: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.

WARNING:tensorflow:From /home/ec2-user/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/lda2vec-0.16.10-py3.6.egg/lda2vec/Lda2vec.py:43: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.

2019-12-09 21:12:17.346986: I tensorflow/core/platform/cpu_feature_guard.cc:145] This TensorFlow binary is optimized with Intel(R) MKL-DNN to use the following CPU instructions in performance critical operations: AVX512F To enable them in non-MKL-DNN operations, rebuild TensorFlow with the appropriate compiler flags. 2019-12-09 21:12:17.378228: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2999995000 Hz 2019-12-09 21:12:17.378444: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55617ec28b80 executing computations on platform Host. Devices: 2019-12-09 21:12:17.378460: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): , 2019-12-09 21:12:17.378600: I tensorflow/core/common_runtime/process_util.cc:115] Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best performance. WARNING:tensorflow:From /home/ec2-user/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/lda2vec-0.16.10-py3.6.egg/lda2vec/Lda2vec.py:105: The name tf.train.import_meta_graph is deprecated. Please use tf.compat.v1.train.import_meta_graph instead.

WARNING:tensorflow:From /home/ec2-user/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/python/training/saver.py:1282: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version. Instructions for updating: Use standard file APIs to check for files with this prefix. 2019-12-09 21:12:17.862352: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1412] (One-time warning): Not using XLA:CPU for cluster because envvar TF_XLA_FLAGS=--tf_xla_cpu_global_jit was not set. If you want XLA:CPU, either set that envvar, or use experimental_jit_scope to enable XLA:CPU. To confirm that XLA is active, pass --vmodule=xla_compilation_cache=1 (as a proper command-line flag, not via TF_XLA_FLAGS) or set the envvar XLA_FLAGS=--xla_hlo_profile. Traceback (most recent call last): File "visualize_20newsgroups.py", line 56, in utils.generate_ldavis_data(data_path, m, idx_to_word, freqs, vocab_size) File "/home/ec2-user/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/lda2vec-0.16.10-py3.6.egg/lda2vec/utils.py", line 172, in generate_ldavis_data File "/home/ec2-user/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/lda2vec-0.16.10-py3.6.egg/lda2vec/utils.py", line 67, in prepare_topics AssertionError: Vocabulary size did not match size of word vectors — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/nateraw/Lda2vec-Tensorflow/issues/62?email_source=notifications&email_token=AAXWFW75MGGPQNJAAVKHKOLQX2ZKFA5CNFSM4JYSQK72YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4H7IXOZQ, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAXWFW6JTD7LR5TL5XT254TQX2ZKFANCNFSM4JYSQK7Q.

nateraw commented 4 years ago

@dbl001 is correct. your error is happening in the pyldavis section. Seems your model loaded fine (unless there are some underlying issues with model load that cause that piece to fail). I'm taking a look now.

nateraw commented 4 years ago

@KyungB do you have the version on master branch installed? If not, do that as a first step. Next, I'd make sure you are using the same preprocessed data as was used to train your model. The error you received is a simple assertion I included at the top of that function. If you used different data to train the model you are trying to visualize topics for, you will almost definitely get that error.

Here is a colab notebook where we run the example, restore the model, and then launch PyLDAvis. Closing this as I was able to get it to run there.

KyungB commented 4 years ago

Thank you for your quick responses. It was because of the change in data apparently.