Closed stefdoerr closed 6 years ago
Stephan- The class you're trying to use is a clean re-write in progress of the network we used in the paper "fc_sqdiff_BP_Direct_EE_ChargeEncode_Update_vdw_DSF_elu_Normalize_Dropout". I can't blame you for trying to use the much better organized new code... But at the moment I think John H. who is developing it is using Spherical Harmonics rather than symmetry functions as his embedding. Yesterday I asked Kun to make sure the symmetry functions are working with this new insance, I'll also be sure he puts some commentary on this thread as he tries it out.
Best- John
Oof that's reassuring 😄 I saw it was much simpler so I thought I would start from there directly but I was actually afraid that this was the legacy code and that you moved to the other more convoluted version. Glad to hear that, haha! I'll wait for Kuns advise then.
I think the issue here is in the sorting of your self.elements and self.element_pairs variables. From the doc string for tf_symmetry_functions"
elements (tf.int32): NElements tensor containing **sorted** unique atomic numbers present
element_pairs (tf.int32): NElementPairs x 2 tensor containing **sorted** unique pairs of atomic numbers present
So your self.elements and self.element pairs variable should be sorted from lowest atomic number to highest, like so.
self.elements = [1 8]
self.element_pairs = [[1 1], [1 8], [8 8]]
I should probably document better that sorted here means lowest to highest. Though I can't recall if I have made a way to sort the elements by default somehow, or if I have just had dumb luck and never hit this issue myself. I believe this should fix your issue.
Thanks! I had missed that. Now I hit a memory error because my system contains 699 atoms. How do you scale to large systems with TensorMol?
File "<ipython-input-8-22a9f6312119>", line 1, in <module>
model.train()
File "TFBehlerParinello.py", line 607, in train
self.compute_normalization()
File "TFBehlerParinello.py", line 246, in compute_normalization
angular_cutoff, radial_rs, angular_rs, theta_s, zeta, eta)
File "/shared/sdoerr/Work/TensorMol/TensorMol/RawEmbeddings.py", line 2942, in tf_symmetry_functions
[num_atoms, num_atoms, num_atoms, num_element_pairs, tf.shape(angular_rs)[0] * tf.shape(theta_s)[0]]),
File "/shared/sdoerr/Software/miniconda3/envs/tensormol/lib/python2.7/site-packages/tensorflow/python/ops/gen_array_ops.py", line 4379, in scatter_nd
"ScatterNd", indices=indices, updates=updates, shape=shape, name=name)
File "/shared/sdoerr/Software/miniconda3/envs/tensormol/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/shared/sdoerr/Software/miniconda3/envs/tensormol/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2956, in create_op
op_def=op_def)
File "/shared/sdoerr/Software/miniconda3/envs/tensormol/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1470, in __init__
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access
ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[699,699,699,3,64]
[[Node: ScatterNd_55 = ScatterNd[T=DT_DOUBLE, Tindices=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"](DynamicPartition_4:27, DynamicPartition_3:27, ScatterNd_55/shape)]]
Only the old instance is sparse, (linear scaling CPU and memory) at this time. It's not too much work to transfer over these features from the old instance.
For the time being if you want the features from the paper, you've gotta use the old instance. I'll post here when I put sparsity in the new instance.
Best- John
So if I don't use dipole data, but only energy and forces, which class should I use for now? Is MolInstance_DirectBP_Grad_Linear
the correct one?
Which is the scalable old version to train on energy and forces? I tried now the following:
PARAMS["learning_rate"] = 0.00001
PARAMS["momentum"] = 0.95
PARAMS["max_steps"] = 201
PARAMS["batch_size"] = 50
PARAMS["test_freq"] = 5
PARAMS["tf_prec"] = "tf.float64"
PARAMS["GradScalar"] = 1
PARAMS["NeuronType"] = "relu"
PARAMS["HiddenLayers"] = [200,200,200]
dig = MolDigester(TreatedAtoms, name_="ANI1_Sym_Direct", OType_="AtomizationEnergy")
tset = TensorMolData_BP_Direct_Linear(myset, dig, order_=1, num_indis_=1, type_="mol", WithGrad_ = True) # Initialize TensorMolData that contain the training data fo
manager=TFMolManage("",tset,False,"fc_sqdiff_BP_Direct_Grad_Linear") # Initialzie a manager than manage the training of neural network.
manager.Train(maxstep=101)
But it runs out of GPU memory with batch sizes over 10 since it creates some batch_size*natoms*natoms*something
arrays
I see that over the weekend Kun added samples/training_sample.py. That would be the best place to start, you can shut off / cut out the dipole training pretty easily, by removing it from the sess.run() in the training and evaluate. A batchsize of 10 might seem too small, but will still work. Another hack would simply be to use CPU training. Tensorflow uses virtual memory just fine on CPU.
Other comments:
Sure, here is the traceback (50 batch size, 233 waters = 699 atoms in the system)
File "<ipython-input-8-e65f5b401744>", line 13, in <module>
manager.Train(maxstep=101)
File "/shared/sdoerr/Work/TensorMol/TensorMol/TFMolManage.py", line 130, in Train
self.Instances.train(self.n_train)
File "/shared/sdoerr/Work/TensorMol/TensorMol/TFMolInstance.py", line 69, in train
self.TrainPrepare(continue_training)
File "/shared/sdoerr/Work/TensorMol/TensorMol/TFMolInstanceDirect.py", line 1967, in TrainPrepare
self.Scatter_Sym, self.Sym_Index = TFSymSet_Scattered_Linear(self.xyzs_pl, self.Zs_pl, Ele, SFPr2, Rr_cut, Elep, SFPa2, zeta, eta, Ra_cut, self.Radp_pl, self.Angt_pl)
File "/shared/sdoerr/Work/TensorMol/TensorMol/RawEmbeddings.py", line 2075, in TFSymSet_Scattered_Linear
GMR = tf.reshape(TFSymRSet_Linear(R, Zs, eles_, SFPsR_, eta, Rr_cut, Radp),[nmol, natom,-1], name="FinishGMR")
File "/shared/sdoerr/Work/TensorMol/TensorMol/RawEmbeddings.py", line 1619, in TFSymRSet_Linear
to_reduce2 = tf.scatter_nd(ind2,Gm,tf.cast([nmol,natom,nele,natom,nsym], dtype=tf.int64))
File "/shared/sdoerr/Software/miniconda3/envs/tensormol/lib/python2.7/site-packages/tensorflow/python/ops/gen_array_ops.py", line 4379, in scatter_nd
"ScatterNd", indices=indices, updates=updates, shape=shape, name=name)
File "/shared/sdoerr/Software/miniconda3/envs/tensormol/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/shared/sdoerr/Software/miniconda3/envs/tensormol/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2956, in create_op
op_def=op_def)
File "/shared/sdoerr/Software/miniconda3/envs/tensormol/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1470, in __init__
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access
ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[50,699,2,699,32]
[[Node: ScatterNd = ScatterNd[T=DT_DOUBLE, Tindices=DT_INT64, _device="/job:localhost/replica:0/task:0/device:GPU:0"](Reshape_11, Reshape_7, Cast_3/_193)]]
[[Node: total_loss/_379 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_2773_total_loss", tensor_type=DT_DOUBLE, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
I mean yes 10 batch size is not the problem, rather that the version it is using scales quadratically with the number of atoms in the system. Density is pretty standard water density at 300K 1atm pressure.
Ah yes the relu
I forgot to change. By the way, did you ever test elu
against softplus
? (just out of curiosity since it's very popular and you already use elu
for the DSF
kernel but I didn't see a mention as an activation function.
Would the code also be able to train on periodic data? I am using a periodic water box so I imagine that the energies and forces at the edges would trip it off if it doesn't understand the periodicity.
Sorry for the sheer amount of questions 😄
Stefan- No need to apologize! You are helping improve the robustness of the code. Honestly I am thrilled that people with your level of intellect and capability are interested in using our work.
I just sat down with Kun, and it turns out the changes we made to allow linear scaling for the periodic code, were never put back into this aperiodic code :P kun is patching the routines now. He'll post here when it's done and tested.
Best- John
I believe this was patched and fixed months ago. Closing issue.
I tried to train
TFBehlerParinello
with my own data but I get the following error:As far as I can tell, my input is shaped/formatted correctly
Inputs to
tf_symmetry_functions
also seem fine to me:Any help would be welcome!