jeff1evesque / ist-664

Syracuse IST-664 Final Project with Chris Wilson (team member)
2 stars 3 forks source link

Build LSTM logic for local chatbot model #74

Closed jeff1evesque closed 5 years ago

jeff1evesque commented 5 years ago

Currently the Vagrantfile deploys a prebuild chatbot, executed by run.py. However, we need to adjust our run.py to optionally build our own model. This can be implemented by invoking an optional flag, when set to True will instead create our own local chatbot model. Though many different solutions exists, we will initially try to implement keras to reduce syntax requirements.

jeff1evesque commented 5 years ago

ef6f289: both tensorflow and keras need to be compatible versions.

jeff1evesque commented 5 years ago

The current keras implementation generates the following trace and model:

root@development:/vagrant# python3 run.py --local
Using TensorFlow backend.
Train on 4 samples, validate on 1 samples
Epoch 1/1
2018-12-23 00:17:03.116867: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2
4/4 [==============================] - 39s 10s/step - loss: 0.1385 - val_loss: 0.1588
/usr/local/lib/python3.5/dist-packages/keras/engine/topology.py:2344: UserWarning: Layer lstm_2 was passed non-serializable keyword arguments: {'initial_state': [<tf.Tensor 'lstm_1/while/Exit_2:0' shape=(?, 256) dtype=float32>, <tf.Tensor 'lstm_1/while/Exit_3:0' shape=(?, 256) dtype=float32>]}. They will not be included in the serialized model (and thus will be missing at deserialization time).
  str(node.arguments) + '. They will not be included '
root@development:/vagrant#
root@development:/vagrant#
root@development:/vagrant# ls -l model/
total 5116
-rwxrwxrwx 1 vagrant vagrant 5235840 Dec 23 00:17 chatbot.h5

We'll need to create logic that will import the corresponding chatbot.h5, then check if the model is capable of performing a prediction. Additionally, it's important that we remember that the nb_samples was scaled down by a factor of 1000 to allow development and debugging on a local machine:

[...INITIAL-CODE-OMITTED...]
    #
    # local variables
    #
    # Note: posts, comments, and nb_samples should be the same length.
    #
    post_chars = set()
    comment_chars = set()
    post_lookup_index = {}
    post_lookup_char = {}
    comment_lookup_char = {}
    comment_lookup_index = {}
    nb_samples = int(len(posts) / 1000)
[...ENDING-CODE-OMITTED...]
jeff1evesque commented 5 years ago

Running the --train locally, on the entire --insert case, results in MemoryError:

root@development:/vagrant# python3 run.py --train
Using TensorFlow backend.
vocabulary size: 21458
Traceback (most recent call last):
  File "run.py", line 120, in <module>
    main(op='train')
  File "run.py", line 70, in main
    model = train(posts, comments, cwd=cwd)
  File "/vagrant/chatbot/app/train.py", line 56, in train
    posts_train = create_posts(posts, vocab_size, post_maxlen, word2idx)
  File "/vagrant/chatbot/app/train.py", line 138, in create_posts
    post_idx = np.zeros(shape=(len(posts), post_maxlen, vocab_size))
MemoryError

If we decrease our vocab size:

$ git diff chatbot/
diff --git a/chatbot/app/train.py b/chatbot/app/train.py
index eb707da..f512eaa 100644
--- a/chatbot/app/train.py
+++ b/chatbot/app/train.py
@@ -50,7 +50,7 @@ def train(
     word2idx = {w:(i+1) for i,(w,_) in enumerate(counter.most_common())}
     idx2word = {v:k for k,v in word2idx.items()}
     idx2word[0] = 'PAD'
-    vocab_size = len(word2idx) + 1
+    vocab_size = int((len(word2idx) + 1)/1000)
     print('vocabulary size: {vocab}'.format(vocab=vocab_size))

     posts_train = create_posts(posts, vocab_size, post_maxlen, word2idx)

We result with an out of bounds error:

root@development:/vagrant# python3 run.py --train
Using TensorFlow backend.
vocabulary size: 21
i: 0, w: Your
Traceback (most recent call last):
  File "run.py", line 120, in <module>
    main(op='train')
  File "run.py", line 70, in main
    model = train(posts, comments, cwd=cwd)
  File "/vagrant/chatbot/app/train.py", line 56, in train
    posts_train = create_posts(posts, vocab_size, post_maxlen, word2idx)
  File "/vagrant/chatbot/app/train.py", line 140, in create_posts
    post = encode(posts[p], post_maxlen,vocab_size, word2idx)
  File "/vagrant/chatbot/app/train.py", line 128, in encode
    indices[i, word2idx[w]] = 1
IndexError: index 486 is out of bounds for axis 1 with size 21
jeff1evesque commented 5 years ago

We temporarily moved all our sample dataset except the first month into chatbot/data2:

root@development:/vagrant# date
Thu Dec 27 04:10:13 UTC 2018
$ git status
On branch feature-74
Changes not staged for commit:
  (use "git add/rm <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)

        modified:   chatbot/data/reddit-2005-12
        deleted:    chatbot/data/reddit-2006-01
        deleted:    chatbot/data/reddit-2006-02
        deleted:    chatbot/data/reddit-2006-03
        deleted:    chatbot/data/reddit-2006-04
        deleted:    chatbot/data/reddit-2006-05

Untracked files:
  (use "git add <file>..." to include in what will be committed)

        chatbot/data2/

no changes added to commit (use "git add" and/or "git commit -a")

Then, we removed approximately half the existing content from chatbot/data/reddit-2006-05, then executed our python run.py --insert followed by python run.py --train:

root@development:/vagrant# python3 run.py --train
Using TensorFlow backend.
vocabulary size: 1865
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
input_1 (InputLayer)         (None, 10, 1865)          0
_________________________________________________________________
lstm_1 (LSTM)                (None, 128)               1020928
_________________________________________________________________
repeat_vector_1 (RepeatVecto (None, 20, 128)           0
_________________________________________________________________
time_distributed_1 (TimeDist (None, 20, 1865)          240585
_________________________________________________________________
activity_regularization_1 (A (None, 20, 1865)          0
_________________________________________________________________
activation_1 (Activation)    (None, 20, 1865)          0
=================================================================
Total params: 1,261,513
Trainable params: 1,261,513
Non-trainable params: 0
_________________________________________________________________
None
Train on 54 samples, validate on 14 samples
Epoch 1/1
2018-12-27 04:12:37.075971: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2
32/54 [================>.............] - ETA: 1s - loss: 26.9587 - acc: 0.0000e+00Epoch 00001: saving model to /vagrant/model/checkpoint.ckpt
54/54 [==============================] - 4s 66ms/step - loss: 23.9524 - acc: 0.0000e+00 - val_loss: 14.3032 - val_acc: 0.0000e+00
root@development:/vagrant# ls -l model/
total 29644
-rwxrwxrwx 1 vagrant vagrant 15165772 Dec 27 04:12 chatbot.h5
-rwxrwxrwx 1 vagrant vagrant 15165772 Dec 27 04:12 checkpoint.ckpt
-rwxrwxrwx 1 vagrant vagrant    18185 Dec 27 04:12 idx2word.pkl