tsudalab / ChemTS

Molecule Design using Monte Carlo Tree Search with Neural Rollout
153 stars 52 forks source link

Accuracy Goes Down And Loss Increases After Fifth Epoch #3

Closed LRParser closed 6 years ago

LRParser commented 6 years ago

Thanks for a great paper and project! Just a quick question - is it expected to see the validation loss increasing during training? Here's an excerpt of my train_RNN.py run:

(cs510-env) joe@msivr:~/ChemTS/train_RNN$ python train_RNN.py 
Using TensorFlow backend.
['\n', '&', 'C', '(', ')', 'c', '1', '2', 'o', '=', 'O', 'N', '3', 'F', '[C@@H]', 'n', '-', '#', 'S', 'Cl', '[O-]', '[C@H]', '[NH+]', '[C@]', 's', 'Br', '/', '[nH]', '[NH3+]', '4', '[NH2+]', '[C@@]', '[N+]', '[nH+]', '\\', '[S@]', '5', '[N-]', '[n+]', '[S@@]', '[S-]', '6', '7', 'I', '[n-]', 'P', '[OH+]', '[NH-]', '[P@@H]', '[P@@]', '[PH2]', '[P@]', '[P+]', '[S+]', '[o+]', '[CH2-]', '[CH-]', '[SH+]', '[O+]', '[s+]', '[PH+]', '[PH]', '8', '[S@@+]']
249456
(249456, 81, 64)
train_RNN.py:276: UserWarning: Update your `GRU` call to the Keras 2 API: `GRU(input_shape=(81, 64), activation="tanh", return_sequences=True, units=512)`
  model.add(GRU(output_dim=512, input_shape=(81,64),activation='tanh',return_sequences=True))
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_1 (Embedding)      (None, 81, 64)            4096      
_________________________________________________________________
gru_1 (GRU)                  (None, 81, 512)           886272    
_________________________________________________________________
dropout_1 (Dropout)          (None, 81, 512)           0         
_________________________________________________________________
gru_2 (GRU)                  (None, 81, 512)           1574400   
_________________________________________________________________
dropout_2 (Dropout)          (None, 81, 512)           0         
_________________________________________________________________
time_distributed_1 (TimeDist (None, 81, 64)            32832     
=================================================================
Total params: 2,497,600
Trainable params: 2,497,600
Non-trainable params: 0
_________________________________________________________________
None
/home/joe/anaconda3/envs/cs510-env/lib/python3.6/site-packages/keras/models.py:844: UserWarning: The `nb_epoch` argument in `fit` has been renamed `epochs`.
  warnings.warn('The `nb_epoch` argument in `fit` '
Train on 224510 samples, validate on 24946 samples
Epoch 1/100
2017-10-24 20:24:31.121363: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-10-24 20:24:31.149499: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-10-24 20:24:31.149559: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-10-24 20:24:31.149574: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-10-24 20:24:31.149586: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
2017-10-24 20:24:32.398217: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:893] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2017-10-24 20:24:32.473252: I tensorflow/core/common_runtime/gpu/gpu_device.cc:955] Found device 0 with properties: 
name: GeForce GTX 1070
major: 6 minor: 1 memoryClockRate (GHz) 1.645
pciBusID 0000:01:00.0
Total memory: 7.92GiB
Free memory: 7.58GiB
2017-10-24 20:24:32.490260: I tensorflow/core/common_runtime/gpu/gpu_device.cc:976] DMA: 0 
2017-10-24 20:24:32.490558: I tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 0:   Y 
2017-10-24 20:24:32.533155: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1070, pci bus id: 0000:01:00.0)
   512/224510 [..............................] - ETA: 130838s - loss: 4.1613 - a  1024/224510 [..............................] - ETA: 66152s - loss: 3.7195 - ac  1536/224510 [..............................] - ETA: 44533s - loss: 4.7937 - ac  2048/224510 [..............................] - ETA: 33699s - loss: 5.2105 - ac  2560/224510 [..............................] - ETA: 27209s - loss: 6.8686 - ac  3072/224510 [..............................] - ETA: 22863s - loss: 7.5061 - ac  3584/224510 [..............................] - ETA: 19750s - loss: 7.5390 - ac  4096/224510 [..............................] - ETA: 17420s - loss: 7.5092 - ac  4608/224510 [..............................] - ETA: 15607s - loss: 7.3906 - ac  5120/224510 [..............................] - ETA: 14145s - loss: 7.0171 - ac  5632/224510 [..............................] - ETA: 12949s - loss: 6.7547 - ac  6144/224510 [..............................] - ETA: 11950s - loss: 6.4149 - ac  6656/224510 [..............................] - ETA: 11096s - loss: 6.1455 - ac  7168/224510 [..............................] - ETA: 10398s - loss: 5.9173 - ac  7680/224510 [>.............................] - ETA: 9762s - loss: 5.6991 - acc  8192/224510 [>.............................] - ETA: 9200s - loss: 5.5889 - acc  8704/224510 [>.............................] - ETA: 8742s - loss: 5.4481 - acc  9216/224510 [>.............................] - ETA: 8296s - loss: 5.2895 - acc  9728/224510 [>.............................] - ETA: 7895s - loss: 5.1330 - acc 10240/224224510/224510 [==============================] - 601s - loss: 1.5048 - acc: 0.5973 - val_loss: 1.0923 - val_acc: 0.6607.8547 - acc: 0.3229
Epoch 2/100
224510/224510 [==============================] - 171s - loss: 1.2080 - acc: 0.6355 - val_loss: 1.1231 - val_acc: 0.6476
Epoch 3/100
224510/224510 [==============================] - 174s - loss: 1.1588 - acc: 0.6409 - val_loss: 1.0980 - val_acc: 0.6479
Epoch 4/100
224510/224510 [==============================] - 175s - loss: 1.1511 - acc: 0.6408 - val_loss: 1.0865 - val_acc: 0.6530
Epoch 5/100
224510/224510 [==============================] - 171s - loss: 1.1522 - acc: 0.6461 - val_loss: 1.0804 - val_acc: 0.6530
Epoch 6/100
224510/224510 [==============================] - 177s - loss: 1.8968 - acc: 0.5763 - val_loss: 2.2989 - val_acc: 0.5471
Epoch 7/100
224510/224510 [==============================] - 172s - loss: 2.3063 - acc: 0.5471 - val_loss: 2.3015 - val_acc: 0.5470
Epoch 8/100
224510/224510 [==============================] - 170s - loss: 2.3078 - acc: 0.5470 - val_loss: 2.3045 - val_acc: 0.5470
Epoch 9/100
224510/224510 [==============================] - 170s - loss: 2.3095 - acc: 0.5469 - val_loss: 2.3082 - val_acc: 0.5470
Epoch 10/100
224510/224510 [==============================] - 170s - loss: 2.3084 - acc: 0.5469 - val_loss: 2.3025 - val_acc: 0.5472
Epoch 11/100
224510/224510 [==============================] - 170s - loss: 2.3110 - acc: 0.5470 - val_loss: 2.3051 - val_acc: 0.5470
Epoch 12/100
224510/224510 [==============================] - 170s - loss: 2.3110 - acc: 0.5470 - val_loss: 2.3112 - val_acc: 0.5460
Epoch 13/100

I would have expected validation loss to go down during training, but perhaps I'm not letting training run long enough? Any guidance here is appreciated.

yangxiufengsia commented 6 years ago

Hi @LRParser, Thanks for reporting this. I think this is the natural of neural network works, which largely depend on the initialization of the parameters in RNN. The loss increases mean that learning rate should be changed to be smaller(code will be updated soon). If you want to obtain fast convergence, please reduce the hidden-dimensions of GRU(e.g. reduce 512 to 256).

I have changed the code of train_RNN.py, please re-run: python train_RNN.py. I think you can obtain very good accuracy and smaller loss.