jparkhill / TensorMol

Tensorflow + Molecules = TensorMol
http://blogs.nd.edu/parkhillgroup
GNU General Public License v3.0
271 stars 75 forks source link

TFBehlerParinello: Dimension problems with tf_symmetry_functions #15

Closed stefdoerr closed 6 years ago

stefdoerr commented 6 years ago

I tried to train TFBehlerParinello with my own data but I get the following error:

  File "<ipython-input-7-22a9f6312119>", line 1, in <module>
    model.train()
  File "TFBehlerParinello.py", line 607, in train
    self.compute_normalization()
  File "TFBehlerParinello.py", line 246, in compute_normalization
    angular_cutoff, radial_rs, angular_rs, theta_s, zeta, eta)
  File "/shared/sdoerr/Work/TensorMol/TensorMol/RawEmbeddings.py", line 2926, in tf_symmetry_functions
    angular_scatter_indices = tf.concat([triples_indices, tf.expand_dims(triples_elements_indices, axis=1)], axis=1)
  File "/shared/sdoerr/Software/miniconda3/envs/tensormol/lib/python2.7/site-packages/tensorflow/python/ops/array_ops.py", line 1099, in concat
    return gen_array_ops._concat_v2(values=values, axis=axis, name=name)
  File "/shared/sdoerr/Software/miniconda3/envs/tensormol/lib/python2.7/site-packages/tensorflow/python/ops/gen_array_ops.py", line 706, in _concat_v2
    "ConcatV2", values=values, axis=axis, name=name)
  File "/shared/sdoerr/Software/miniconda3/envs/tensormol/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/shared/sdoerr/Software/miniconda3/envs/tensormol/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2956, in create_op
    op_def=op_def)
  File "/shared/sdoerr/Software/miniconda3/envs/tensormol/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1470, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

InvalidArgumentError (see above for traceback): ConcatOp : Dimensions of inputs should match: shape[0] = [78587,4] vs. shape[1] = [27483,1]
     [[Node: concat_104 = ConcatV2[N=2, T=DT_INT32, Tidx=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"](concat_100, ExpandDims_222, Fill/dims/1)]]

As far as I can tell, my input is shaped/formatted correctly

ipdb> 
> TFBehlerParinello.py(273)compute_normalization()
    271                 gradients_list.append(batch_data[3][molecule, :num_atoms[molecule]])
    272             embedding, molecule_index = sess.run([embeddings, molecule_indices],
--> 273                                                  feed_dict={xyzs_pl: batch_data[0], Zs_pl: batch_data[1]})
    274             for element in range(len(self.elements)):
    275                 embeddings_list[element].append(embedding[element])
ipdb> batch_data[0].shape
(100, 699, 3)
ipdb> batch_data[1].shape
(100, 699)
ipdb> batch_data[0].dtype
dtype('float32')
ipdb> batch_data[1].dtype
dtype('uint8')

Inputs to tf_symmetry_functions also seem fine to me:

ipdb> self.elements
(8, 1)
ipdb> self.element_pairs
array([[8, 8],
       [8, 1],
       [1, 1]])
ipdb> self.radial_rs
array([ 0.00000000,  0.14375000,  0.28750000,  0.43125000,  0.57500000,
        0.71875000,  0.86250000,  1.00625000,  1.15000000,  1.29375000,
        1.43750000,  1.58125000,  1.72500000,  1.86875000,  2.01250000,
        2.15625000,  2.30000000,  2.44375000,  2.58750000,  2.73125000,
        2.87500000,  3.01875000,  3.16250000,  3.30625000,  3.45000000,
        3.59375000,  3.73750000,  3.88125000,  4.02500000,  4.16875000,
        4.31250000,  4.45625000])
ipdb> self.angular_rs
array([ 0.00000000,  0.38750000,  0.77500000,  1.16250000,  1.55000000,
        1.93750000,  2.32500000,  2.71250000])
ipdb> self.zeta
8.0
ipdb> self.eta
4.0
ipdb> self.theta_s
array([ 0.00000000,  0.78539816,  1.57079633,  2.35619449,  3.14159265,
        3.92699082,  4.71238898,  5.49778714])
ipdb> self.angular_cutoff
3.1
ipdb> self.radial_cutoff
4.6

Any help would be welcome!

jparkhill commented 6 years ago

Stephan- The class you're trying to use is a clean re-write in progress of the network we used in the paper "fc_sqdiff_BP_Direct_EE_ChargeEncode_Update_vdw_DSF_elu_Normalize_Dropout". I can't blame you for trying to use the much better organized new code... But at the moment I think John H. who is developing it is using Spherical Harmonics rather than symmetry functions as his embedding. Yesterday I asked Kun to make sure the symmetry functions are working with this new insance, I'll also be sure he puts some commentary on this thread as he tries it out.

Best- John

stefdoerr commented 6 years ago

Oof that's reassuring 😄 I saw it was much simpler so I thought I would start from there directly but I was actually afraid that this was the legacy code and that you moved to the other more convoluted version. Glad to hear that, haha! I'll wait for Kuns advise then.

jeherr commented 6 years ago

I think the issue here is in the sorting of your self.elements and self.element_pairs variables. From the doc string for tf_symmetry_functions"

elements (tf.int32): NElements tensor containing **sorted** unique atomic numbers present
element_pairs (tf.int32): NElementPairs x 2 tensor containing **sorted** unique pairs of atomic numbers present

So your self.elements and self.element pairs variable should be sorted from lowest atomic number to highest, like so.

self.elements = [1 8]
self.element_pairs = [[1 1], [1 8], [8 8]]

I should probably document better that sorted here means lowest to highest. Though I can't recall if I have made a way to sort the elements by default somehow, or if I have just had dumb luck and never hit this issue myself. I believe this should fix your issue.

stefdoerr commented 6 years ago

Thanks! I had missed that. Now I hit a memory error because my system contains 699 atoms. How do you scale to large systems with TensorMol?

  File "<ipython-input-8-22a9f6312119>", line 1, in <module>
    model.train()
  File "TFBehlerParinello.py", line 607, in train
    self.compute_normalization()
  File "TFBehlerParinello.py", line 246, in compute_normalization
    angular_cutoff, radial_rs, angular_rs, theta_s, zeta, eta)
  File "/shared/sdoerr/Work/TensorMol/TensorMol/RawEmbeddings.py", line 2942, in tf_symmetry_functions
    [num_atoms, num_atoms, num_atoms, num_element_pairs, tf.shape(angular_rs)[0] * tf.shape(theta_s)[0]]),
  File "/shared/sdoerr/Software/miniconda3/envs/tensormol/lib/python2.7/site-packages/tensorflow/python/ops/gen_array_ops.py", line 4379, in scatter_nd
    "ScatterNd", indices=indices, updates=updates, shape=shape, name=name)
  File "/shared/sdoerr/Software/miniconda3/envs/tensormol/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/shared/sdoerr/Software/miniconda3/envs/tensormol/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2956, in create_op
    op_def=op_def)
  File "/shared/sdoerr/Software/miniconda3/envs/tensormol/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1470, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[699,699,699,3,64]
     [[Node: ScatterNd_55 = ScatterNd[T=DT_DOUBLE, Tindices=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"](DynamicPartition_4:27, DynamicPartition_3:27, ScatterNd_55/shape)]]
jparkhill commented 6 years ago

Only the old instance is sparse, (linear scaling CPU and memory) at this time. It's not too much work to transfer over these features from the old instance.

For the time being if you want the features from the paper, you've gotta use the old instance. I'll post here when I put sparsity in the new instance.

Best- John

stefdoerr commented 6 years ago

So if I don't use dipole data, but only energy and forces, which class should I use for now? Is MolInstance_DirectBP_Grad_Linear the correct one?

stefdoerr commented 6 years ago

Which is the scalable old version to train on energy and forces? I tried now the following:

PARAMS["learning_rate"] = 0.00001
PARAMS["momentum"] = 0.95
PARAMS["max_steps"] = 201
PARAMS["batch_size"] = 50
PARAMS["test_freq"] = 5
PARAMS["tf_prec"] = "tf.float64"
PARAMS["GradScalar"] = 1
PARAMS["NeuronType"] = "relu"
PARAMS["HiddenLayers"] = [200,200,200]
dig = MolDigester(TreatedAtoms, name_="ANI1_Sym_Direct", OType_="AtomizationEnergy")
tset = TensorMolData_BP_Direct_Linear(myset, dig, order_=1, num_indis_=1, type_="mol",  WithGrad_ = True) # Initialize TensorMolData that contain the training data fo
manager=TFMolManage("",tset,False,"fc_sqdiff_BP_Direct_Grad_Linear") # Initialzie a manager than manage the training of neural network.
manager.Train(maxstep=101)

But it runs out of GPU memory with batch sizes over 10 since it creates some batch_size*natoms*natoms*something arrays

jparkhill commented 6 years ago

I see that over the weekend Kun added samples/training_sample.py. That would be the best place to start, you can shut off / cut out the dipole training pretty easily, by removing it from the sess.run() in the training and evaluate. A batchsize of 10 might seem too small, but will still work. Another hack would simply be to use CPU training. Tensorflow uses virtual memory just fine on CPU.

Other comments:

stefdoerr commented 6 years ago

Sure, here is the traceback (50 batch size, 233 waters = 699 atoms in the system)

  File "<ipython-input-8-e65f5b401744>", line 13, in <module>
    manager.Train(maxstep=101)
  File "/shared/sdoerr/Work/TensorMol/TensorMol/TFMolManage.py", line 130, in Train
    self.Instances.train(self.n_train)
  File "/shared/sdoerr/Work/TensorMol/TensorMol/TFMolInstance.py", line 69, in train
    self.TrainPrepare(continue_training)
  File "/shared/sdoerr/Work/TensorMol/TensorMol/TFMolInstanceDirect.py", line 1967, in TrainPrepare
    self.Scatter_Sym, self.Sym_Index  = TFSymSet_Scattered_Linear(self.xyzs_pl, self.Zs_pl, Ele, SFPr2, Rr_cut, Elep, SFPa2, zeta, eta, Ra_cut, self.Radp_pl, self.Angt_pl)
  File "/shared/sdoerr/Work/TensorMol/TensorMol/RawEmbeddings.py", line 2075, in TFSymSet_Scattered_Linear
    GMR = tf.reshape(TFSymRSet_Linear(R, Zs, eles_, SFPsR_, eta, Rr_cut, Radp),[nmol, natom,-1], name="FinishGMR")
  File "/shared/sdoerr/Work/TensorMol/TensorMol/RawEmbeddings.py", line 1619, in TFSymRSet_Linear
    to_reduce2 = tf.scatter_nd(ind2,Gm,tf.cast([nmol,natom,nele,natom,nsym], dtype=tf.int64))
  File "/shared/sdoerr/Software/miniconda3/envs/tensormol/lib/python2.7/site-packages/tensorflow/python/ops/gen_array_ops.py", line 4379, in scatter_nd
    "ScatterNd", indices=indices, updates=updates, shape=shape, name=name)
  File "/shared/sdoerr/Software/miniconda3/envs/tensormol/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/shared/sdoerr/Software/miniconda3/envs/tensormol/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2956, in create_op
    op_def=op_def)
  File "/shared/sdoerr/Software/miniconda3/envs/tensormol/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1470, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[50,699,2,699,32]
     [[Node: ScatterNd = ScatterNd[T=DT_DOUBLE, Tindices=DT_INT64, _device="/job:localhost/replica:0/task:0/device:GPU:0"](Reshape_11, Reshape_7, Cast_3/_193)]]
     [[Node: total_loss/_379 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_2773_total_loss", tensor_type=DT_DOUBLE, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

I mean yes 10 batch size is not the problem, rather that the version it is using scales quadratically with the number of atoms in the system. Density is pretty standard water density at 300K 1atm pressure.

Ah yes the relu I forgot to change. By the way, did you ever test elu against softplus? (just out of curiosity since it's very popular and you already use elu for the DSF kernel but I didn't see a mention as an activation function.

Would the code also be able to train on periodic data? I am using a periodic water box so I imagine that the energies and forces at the edges would trip it off if it doesn't understand the periodicity.

Sorry for the sheer amount of questions 😄

jparkhill commented 6 years ago

Stefan- No need to apologize! You are helping improve the robustness of the code. Honestly I am thrilled that people with your level of intellect and capability are interested in using our work.

I just sat down with Kun, and it turns out the changes we made to allow linear scaling for the periodic code, were never put back into this aperiodic code :P kun is patching the routines now. He'll post here when it's done and tested.

Best- John

jeherr commented 6 years ago

I believe this was patched and fixed months ago. Closing issue.