vlimant / NNLO

GNU General Public License v3.0
2 stars 7 forks source link

torch model cloning issue #18

Open vlimant opened 5 years ago

vlimant commented 5 years ago

using https://github.com/vlimant/ornl-nnlo/blob/master/hls4mlJEDI.py with

mpirun -np 3 --tag-output python3 TrainingDriver.py --model hls4mlJEDI.py --loss categorical_crossentropy --epochs 1 --backend torch

fails in

[1,0]<stderr>: model = copy.deepcopy(self.model)

vlimant commented 5 years ago

full stack


[1,0]<stderr>:Traceback (most recent call last):
[1,0]<stderr>:  File "TrainingDriver.py", line 290, in <module>
[1,0]<stderr>:    checkpoint=args.checkpoint, checkpoint_interval=args.checkpoint_interval)
[1,0]<stderr>:  File "/nfshome/vlimant/NNLO/nnlo/mpi/manager.py", line 164, in __init__
[1,0]<stderr>:    self.make_comms(comm)
[1,0]<stderr>:  File "/nfshome/vlimant/NNLO/nnlo/mpi/manager.py", line 252, in make_comms
[1,0]<stderr>:    checkpoint=self.checkpoint, checkpoint_interval=self.checkpoint_interval
[1,0]<stderr>:  File "/nfshome/vlimant/NNLO/nnlo/mpi/process.py", line 661, in __init__
[1,0]<stderr>:    checkpoint=checkpoint, checkpoint_interval=checkpoint_interval )
[1,0]<stderr>:  File "/nfshome/vlimant/NNLO/nnlo/mpi/process.py", line 117, in __init__
[1,0]<stderr>:    self.build_model()
[1,0]<stderr>:  File "/nfshome/vlimant/NNLO/nnlo/mpi/process.py", line 964, in build_model
[1,0]<stderr>:    super(MPIMaster, self).build_model(local_session=self.threaded_validation)
[1,0]<stderr>:  File "/nfshome/vlimant/NNLO/nnlo/mpi/process.py", line 163, in build_model
[1,0]<stderr>:    self.model = self.model_builder.build_model(local_session=local_session)
[1,0]<stderr>:  File "/nfshome/vlimant/NNLO/nnlo/train/model.py", line 502, in build_model
[1,0]<stderr>:    model = copy.deepcopy(self.model)
[1,0]<stderr>:  File "/usr/lib/python3.5/copy.py", line 182, in deepcopy
[1,0]<stderr>:    y = _reconstruct(x, rv, 1, memo)
[1,0]<stderr>:  File "/usr/lib/python3.5/copy.py", line 297, in _reconstruct
[1,0]<stderr>:    state = deepcopy(state, memo)
[1,0]<stderr>:  File "/usr/lib/python3.5/copy.py", line 155, in deepcopy
[1,0]<stderr>:    y = copier(x, memo)
[1,0]<stderr>:  File "/usr/lib/python3.5/copy.py", line 243, in _deepcopy_dict
[1,0]<stderr>:    y[deepcopy(key, memo)] = deepcopy(value, memo)
[1,0]<stderr>:  File "/usr/lib/python3.5/copy.py", line 166, in deepcopy
[1,0]<stderr>:    y = copier(memo)
[1,0]<stderr>:  File "/usr/local/lib/python3.5/dist-packages/torch/autograd/variable.py", line 91, in __deepcopy__
[1,0]<stderr>:    raise RuntimeError("Only Variables created explicitly by the user "
[1,0]<stderr>:RuntimeError: Only Variables created explicitly by the user (graph leaves) support the deepcopy protocol at the moment