vlimant / NNLO

GNU General Public License v3.0
2 stars 7 forks source link

optimizer initialization #25

Closed vlimant closed 5 years ago

vlimant commented 5 years ago

while running

mpirun --prefix /opt/openmpi-3.1.0 -np 7 --tag-output singularity exec --nv -B /imdata/ -B /storage/ /storage/group/gpu/software/singularity/ibanks/edge.simg python3 Optiropy --epochs 1000 --batch 200 --model examples/example_jedi_torch.py --early "val_loss,~<,4" --checkpoint jedi-3 --mode gem --worker-optimizer adam --cache /imdata/ --block-size 3

I got

[1,5]: block.run() [1,5]: File "/nfshome/vlimant/NNLO/nnlo/optimize/process_block.py", line 157, in run [1,5]: fom = self.train_model() [1,5]: File "/nfshome/vlimant/NNLO/nnlo/optimize/process_block.py", line 127, in train_model [1,5]: checkpoint_interval=self.checkpoint_interval) [1,5]: File "/nfshome/vlimant/NNLO/nnlo/mpi/manager.py", line 420, in init [1,5]: checkpoint=checkpoint, checkpoint_interval=checkpoint_interval) [1,5]: File "/nfshome/vlimant/NNLO/nnlo/mpi/manager.py", line 165, in init [1,5]: self.make_comms(comm) [1,5]: File "/nfshome/vlimant/NNLO/nnlo/mpi/manager.py", line 266, in make_comms [1,5]: checkpoint=self.checkpoint, checkpoint_interval=self.checkpoint_interval [1,5]: File "/nfshome/vlimant/NNLO/nnlo/mpi/process.py", line 493, in init [1,5]: checkpoint=checkpoint, checkpoint_interval=checkpoint_interval ) [1,5]: File "/nfshome/vlimant/NNLO/nnlo/mpi/process.py", line 121, in init [1,5]: self.train() [1,5]: File "/nfshome/vlimant/NNLO/nnlo/mpi/process.py", line 562, in train [1,5]: self.sync_with_parent() [1,5]: File "/nfshome/vlimant/NNLO/nnlo/mpi/process.py", line 503, in sync_with_parent [1,5]: self.compute_update() [1,5]: File "/nfshome/vlimant/NNLO/nnlo/util/timeline.py", line 21, in wrapped_function [1,5]: ret_val = function(*args, **kwargs) [1,5]: File "/nfshome/vlimant/NNLO/nnlo/mpi/process.py", line 596, in compute_update [1,5]: self.update = self.algo.compute_update( self.weights, self.model.get_weights() ) [1,5]: File "/nfshome/vlimant/NNLO/nnlo/train/algo.py", line 107, in compute_update [1,5]: return self.optimizer.begin_compute_update(cur_weights, new_weights) [1,5]: File "/nfshome/vlimant/NNLO/nnlo/train/optimizer.py", line 573, in begin_compute_update [1,5]: self.moment[idx] += update[idx] [1,5]:ValueError: operands could not be broadcast together with shapes (51,32) (10,32) (51,32)

vlimant commented 5 years ago

could be solved with 4a26d0ee0d6345071a223b313df9b23c0dc4f9e9

vlimant commented 5 years ago

that did the trick