Open vlimant opened 5 years ago
full stack
[1,0]<stderr>:Traceback (most recent call last):
[1,0]<stderr>: File "TrainingDriver.py", line 290, in <module>
[1,0]<stderr>: checkpoint=args.checkpoint, checkpoint_interval=args.checkpoint_interval)
[1,0]<stderr>: File "/nfshome/vlimant/NNLO/nnlo/mpi/manager.py", line 164, in __init__
[1,0]<stderr>: self.make_comms(comm)
[1,0]<stderr>: File "/nfshome/vlimant/NNLO/nnlo/mpi/manager.py", line 252, in make_comms
[1,0]<stderr>: checkpoint=self.checkpoint, checkpoint_interval=self.checkpoint_interval
[1,0]<stderr>: File "/nfshome/vlimant/NNLO/nnlo/mpi/process.py", line 661, in __init__
[1,0]<stderr>: checkpoint=checkpoint, checkpoint_interval=checkpoint_interval )
[1,0]<stderr>: File "/nfshome/vlimant/NNLO/nnlo/mpi/process.py", line 117, in __init__
[1,0]<stderr>: self.build_model()
[1,0]<stderr>: File "/nfshome/vlimant/NNLO/nnlo/mpi/process.py", line 964, in build_model
[1,0]<stderr>: super(MPIMaster, self).build_model(local_session=self.threaded_validation)
[1,0]<stderr>: File "/nfshome/vlimant/NNLO/nnlo/mpi/process.py", line 163, in build_model
[1,0]<stderr>: self.model = self.model_builder.build_model(local_session=local_session)
[1,0]<stderr>: File "/nfshome/vlimant/NNLO/nnlo/train/model.py", line 502, in build_model
[1,0]<stderr>: model = copy.deepcopy(self.model)
[1,0]<stderr>: File "/usr/lib/python3.5/copy.py", line 182, in deepcopy
[1,0]<stderr>: y = _reconstruct(x, rv, 1, memo)
[1,0]<stderr>: File "/usr/lib/python3.5/copy.py", line 297, in _reconstruct
[1,0]<stderr>: state = deepcopy(state, memo)
[1,0]<stderr>: File "/usr/lib/python3.5/copy.py", line 155, in deepcopy
[1,0]<stderr>: y = copier(x, memo)
[1,0]<stderr>: File "/usr/lib/python3.5/copy.py", line 243, in _deepcopy_dict
[1,0]<stderr>: y[deepcopy(key, memo)] = deepcopy(value, memo)
[1,0]<stderr>: File "/usr/lib/python3.5/copy.py", line 166, in deepcopy
[1,0]<stderr>: y = copier(memo)
[1,0]<stderr>: File "/usr/local/lib/python3.5/dist-packages/torch/autograd/variable.py", line 91, in __deepcopy__
[1,0]<stderr>: raise RuntimeError("Only Variables created explicitly by the user "
[1,0]<stderr>:RuntimeError: Only Variables created explicitly by the user (graph leaves) support the deepcopy protocol at the moment
using https://github.com/vlimant/ornl-nnlo/blob/master/hls4mlJEDI.py with
mpirun -np 3 --tag-output python3 TrainingDriver.py --model hls4mlJEDI.py --loss categorical_crossentropy --epochs 1 --backend torch
fails in
[1,0]<stderr>: model = copy.deepcopy(self.model)