jeff1evesque / ist-664

Syracuse IST-664 Final Project with Chris Wilson (team member)
2 stars 3 forks source link

Choose which files to ingest #100

Closed jeff1evesque closed 5 years ago

jeff1evesque commented 5 years ago

We need an ability to select specific files, or all files to ingest into the database. Copying files out of a directory prior to ingest is cumbersome.

jeff1evesque commented 5 years ago

8be5ee7: local LSTM model was trained using the entire reddit-2005-12 data:

root@development:/vagrant# python3 run.py --train
Using TensorFlow backend.
vocabulary size: 3764
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
input_1 (InputLayer)         (None, 10, 3764)          0
_________________________________________________________________
lstm_1 (LSTM)                (None, 128)               1993216
_________________________________________________________________
repeat_vector_1 (RepeatVecto (None, 20, 128)           0
_________________________________________________________________
time_distributed_1 (TimeDist (None, 20, 3764)          485556
_________________________________________________________________
activity_regularization_1 (A (None, 20, 3764)          0
_________________________________________________________________
activation_1 (Activation)    (None, 20, 3764)          0
=================================================================
Total params: 2,478,772
Trainable params: 2,478,772
Non-trainable params: 0
_________________________________________________________________
None
Train on 144 samples, validate on 36 samples
Epoch 1/1
2018-12-30 22:52:39.142096: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2
128/144 [=========================>....] - ETA: 0s - loss: 18.0679 - acc: 3.9063e-04Epoch 00001: saving model to /vagrant/Reddit/model/checkpoint.ckpt
144/144 [==============================] - 8s 54ms/step - loss: 17.3768 - acc: 3.4722e-04 - val_loss: 14.7382 - val_acc: 0.0028

The prediction seems garbled:

root@development:/vagrant# python3 run.py --local
Using TensorFlow backend.

> hey did I input enough data?
2018-12-30 22:54:07.765990: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2
['sql sql sql sql sql sql sql sql sql sql sql sql sql sql sql sql sql sql sql sql', 'exposes exposes exposes exposes exposes exposes exposes exposes exposes exposes exposes exposes exposes exposes exposes exposes exposes exposes exposes exposes', 'debt debt debt debt debt debt debt debt debt debt debt debt debt debt debt debt debt debt debt debt', 'sql sql sql sql sql sql sql sql sql sql sql sql sql sql sql sql sql sql sql sql', 'sql sql sql sql sql sql sql sql sql sql sql sql sql sql sql sql sql sql sql sql', 'Chu Chu Chu Chu Chu Chu Chu Chu Chu Chu Chu Chu Chu Chu Chu Chu Chu Chu Chu Chu', 'sql sql sql sql sql sql sql sql sql sql sql sql sql sql sql sql sql sql sql sql', 'sql sql sql sql sql sql sql sql sql sql sql sql sql sql sql sql sql sql sql sql', 'debt debt debt debt debt debt debt debt debt debt debt debt debt debt debt debt debt debt debt debt', 'sql sql sql sql sql sql sql sql sql sql sql sql sql sql sql sql sql sql sql sql', 'Chu Chu Chu Chu Chu Chu Chu Chu Chu Chu Chu Chu Chu Chu Chu Chu Chu Chu Chu Chu', 'debt debt debt debt debt debt debt debt debt debt debt debt debt debt debt debt debt debt debt debt', 'sql sql sql sql sql sql sql sql sql sql sql sql sql sql sql sql sql sql sql sql', 'sql sql sql sql sql sql sql sql sql sql sql sql sql sql sql sql sql sql sql sql', 'exposes exposes exposes exposes exposes exposes exposes exposes exposes exposes exposes exposes exposes exposes exposes exposes exposes exposes exposes exposes', 'sql sql sql sql sql sql sql sql sql sql sql sql sql sql sql sql sql sql sql sql', 'exposes exposes exposes exposes exposes exposes exposes exposes exposes exposes exposes exposes exposes exposes exposes exposes exposes exposes exposes exposes', 'debt debt debt debt debt debt debt debt debt debt debt debt debt debt debt debt debt debt debt debt', 'sql sql sql sql sql sql sql sql sql sql sql sql sql sql sql sql sql sql sql sql', 'sql sql sql sql sql sql sql sql sql sql sql sql sql sql sql sql sql sql sql sql', 'sql sql sql sql sql sql sql sql sql sql sql sql sql sql sql sql sql sql sql sql', 'sql sql sql sql sql sql sql sql sql sql sql sql sql sql sql sql sql sql sql sql', 'sql sql sql sql sql sql sql sql sql sql sql sql sql sql sql sql sql sql sql sql', 'sql sql sql sql sql sql sql sql sql sql sql sql sql sql sql sql sql sql sql sql', 'CVS CVS CVS CVS CVS CVS CVS CVS CVS CVS CVS CVS CVS CVS CVS CVS CVS CVS CVS CVS', 'exposes exposes exposes exposes exposes exposes exposes exposes exposes exposes exposes exposes exposes exposes exposes exposes exposes exposes exposes exposes', 'CVS CVS CVS CVS CVS CVS CVS CVS CVS CVS CVS CVS CVS CVS CVS CVS CVS CVS CVS CVS', 'sql sql sql sql sql sql sql sql sql sql sql sql sql sql sql sql sql sql sql sql']
jeff1evesque commented 5 years ago

The reddit-2006-01 dataset was appended to the mongodb collection. The single epoch train contained more training samples:

root@development:/vagrant# python3 run.py --train
Using TensorFlow backend.
vocabulary size: 8631
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
input_1 (InputLayer)         (None, 10, 8631)          0
_________________________________________________________________
lstm_1 (LSTM)                (None, 128)               4485120
_________________________________________________________________
repeat_vector_1 (RepeatVecto (None, 20, 128)           0
_________________________________________________________________
time_distributed_1 (TimeDist (None, 20, 8631)          1113399
_________________________________________________________________
activity_regularization_1 (A (None, 20, 8631)          0
_________________________________________________________________
activation_1 (Activation)    (None, 20, 8631)          0
=================================================================
Total params: 5,598,519
Trainable params: 5,598,519
Non-trainable params: 0
_________________________________________________________________
None
Train on 598 samples, validate on 150 samples
Epoch 1/1
2018-12-30 22:58:13.389857: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2
576/598 [===========================>..] - ETA: 2s - loss: 11.6508 - acc: 0.0000e+00Epoch 00001: saving model to /vagrant/Reddit/model/checkpoint.ckpt
598/598 [==============================] - 65s 108ms/step - loss: 11.5469 - acc: 0.0000e+00 - val_loss: 9.2770 - val_acc: 3.3333e-04

The prediction seems to worsen

root@development:/vagrant# python3 run.py --local
Using TensorFlow backend.

> hey did I input enough data?
2018-12-30 23:02:10.689034: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2
['kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding', 'kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding', 'kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding', 'kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding', 'kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding', 'kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding', 'kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding', 'kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding', 'kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding', 'kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding', 'kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding', 'kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding', 'kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding', 'kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding', 'kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding', 'kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding', 'kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding', 'kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding', 'kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding', 'kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding', 'kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding', 'kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding', 'kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding', 'kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding', 'kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding', 'kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding', 'kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding', 'kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding kidding']
jeff1evesque commented 5 years ago

Five epoch were used for both reddit-2005-12, and reddit-2006-01:

root@development:/vagrant# python3 run.py --train
Using TensorFlow backend.
vocabulary size: 8631
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
input_1 (InputLayer)         (None, 10, 8631)          0
_________________________________________________________________
lstm_1 (LSTM)                (None, 128)               4485120
_________________________________________________________________
repeat_vector_1 (RepeatVecto (None, 20, 128)           0
_________________________________________________________________
time_distributed_1 (TimeDist (None, 20, 8631)          1113399
_________________________________________________________________
activity_regularization_1 (A (None, 20, 8631)          0
_________________________________________________________________
activation_1 (Activation)    (None, 20, 8631)          0
=================================================================
Total params: 5,598,519
Trainable params: 5,598,519
Non-trainable params: 0
_________________________________________________________________
None
Train on 598 samples, validate on 150 samples
Epoch 1/5
2018-12-30 23:03:39.297038: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2
576/598 [===========================>..] - ETA: 2s - loss: 11.5413 - acc: 2.6042e-04Epoch 00001: saving model to /vagrant/Reddit/model/checkpoint.ckpt
598/598 [==============================] - 59s 99ms/step - loss: 11.4308 - acc: 2.5084e-04 - val_loss: 9.3417 - val_acc: 0.0010
Epoch 2/5
576/598 [===========================>..] - ETA: 2s - loss: 8.3701 - acc: 9.5486e-04Epoch 00002: saving model to /vagrant/Reddit/model/checkpoint.ckpt
598/598 [==============================] - 65s 109ms/step - loss: 8.3517 - acc: 9.1973e-04 - val_loss: 8.4119 - val_acc: 0.0000e+00
Epoch 3/5
576/598 [===========================>..] - ETA: 2s - loss: 7.7839 - acc: 1.7361e-04Epoch 00003: saving model to /vagrant/Reddit/model/checkpoint.ckpt
598/598 [==============================] - 67s 112ms/step - loss: 7.7859 - acc: 2.5084e-04 - val_loss: 8.1504 - val_acc: 0.0000e+00
Epoch 4/5
576/598 [===========================>..] - ETA: 2s - loss: 7.6059 - acc: 8.6806e-05Epoch 00004: saving model to /vagrant/Reddit/model/checkpoint.ckpt
598/598 [==============================] - 62s 104ms/step - loss: 7.6452 - acc: 1.6722e-04 - val_loss: 8.0417 - val_acc: 0.0000e+00
Epoch 5/5
576/598 [===========================>..] - ETA: 2s - loss: 7.5787 - acc: 0.0000e+00Epoch 00005: saving model to /vagrant/Reddit/model/checkpoint.ckpt
598/598 [==============================] - 61s 102ms/step - loss: 7.5938 - acc: 0.0000e+00 - val_loss: 7.9829 - val_acc: 0.0000e+00
Exception ignored in: <bound method BaseSession.__del__ of <tensorflow.python.client.session.Session object at 0x7fae14501320>>
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 696, in __del__
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/c_api_util.py", line 30, in __init__
TypeError: 'NoneType' object is not callable
root@development:/vagrant#

Though further investigation is needed regarding the above TypeError, the associated prediction is slight more rich in the response vocabulary:

root@development:/vagrant# python3 run.py --local
Using TensorFlow backend.

> hey did I input enough data?
2018-12-30 23:09:41.461844: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2
['vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic', 'vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic', 'debated debated debated debated debated debated debated debated debated debated debated debated debated debated debated debated debated debated debated debated', 'vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic', 'expressing expressing expressing expressing expressing expressing expressing expressing expressing expressing expressing expressing expressing expressing expressing expressing expressing expressing expressing expressing', 'stricter stricter stricter stricter stricter stricter stricter stricter stricter stricter stricter stricter stricter stricter stricter stricter stricter stricter stricter stricter', 'expressing expressing expressing expressing expressing expressing expressing expressing expressing expressing expressing expressing expressing expressing expressing expressing expressing expressing expressing expressing', 'vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic', 'vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic', 'vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic', 'stricter stricter stricter stricter stricter stricter stricter stricter stricter stricter stricter stricter stricter stricter stricter stricter stricter stricter stricter stricter', 'vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic', 'investment investment investment investment investment investment investment investment investment investment investment investment investment investment investment investment investment investment investment investment', 'vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic', 'debated debated debated debated debated debated debated debated debated debated debated debated debated debated debated debated debated debated debated debated', 'vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic', 'vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic', 'vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic', 'stem stem stem stem stem stem stem stem stem stem stem stem stem stem stem stem stem stem stem stem', 'vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic', 'vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic', 'vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic', 'vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic', 'expressing expressing expressing expressing expressing expressing expressing expressing expressing expressing expressing expressing expressing expressing expressing expressing expressing expressing expressing expressing', 'vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic', 'debated debated debated debated debated debated debated debated debated debated debated debated debated debated debated debated debated debated debated debated', 'vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic', 'vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic vitriolic']
jeff1evesque commented 5 years ago

acfd044: associated lstm pickled components are approaching 100MB. Our next test will likely involve appending another dataset, as well as increasing the epoch. The generated model will likely exceed github's 100MB limit. Therefore, the remaining exploration of this issue will remain as an exercise.

jeff1evesque commented 5 years ago

Attempting to append our reddit-2006-02 into our mongodb, followed by train yields MemoryError:

root@development:/vagrant# python3 run.py --train
Using TensorFlow backend.
vocabulary size: 15483
Traceback (most recent call last):
  File "run.py", line 176, in <module>
    main(op='train')
  File "run.py", line 96, in main
    cwd=cwd
  File "/vagrant/Reddit/app/train.py", line 66, in train
    word2idx
  File "/vagrant/Reddit/app/train.py", line 194, in create_comments
    comment_idx = np.zeros(shape=(len(comments), comment_maxlen, vocab_size))
MemoryError

Therefore, we'll likely need to scale up from our virtualbox instance to a larger compute based system.