Open ntraft opened 2 years ago
Update: I think this is actually an issue with PyTorch on AMD as described here, and I have reproduced it outside of Vega. I am currently going through the fixes in that thread to see if they can help, and also trying to get access to an NVIDIA card so I can see if that works.
In that case, the only issue is the one I fixed in my fork. If I can confirm it works, I will submit a pull request and close this issue.
@ntraft :+1:
I was able to get the example to run to the end. Unrelated to MPI, I have another question:
Is it running successfully? For a successful run, what kind of output should I expect?
Because I don't see any output model anywhere. The doc here says that it should output a model at tasks/<task id\>/output/fully_train/model_<id\>.pth
, but for me this folder only contains:
$ l tasks/0405.122724.100/output/fully_train/
desc_1.json output.csv performance_1.json
I can see some .pt
files in tasks/0405.122724.100/workers/nas/0/
, but those appear to only be checkpoints made during the nas
stage.
Here is the output from the end of my pipeline:
INFO:root:worker id [0], epoch [500/500], train step [180/196], loss [ 0.035, 0.083], lr [ 0.0010004], time pre batch [0.601s] , total mean time per batch [0.615s]
INFO:root:worker id [0], epoch [500/500], train step [190/196], loss [ 0.078, 0.083], lr [ 0.0010004], time pre batch [0.664s] , total mean time per batch [0.615s]
INFO:root:Finished the unified trainer successfully.
INFO:root:Update Success. step_name=nas, worker_id=0
INFO:root:Best values: [{'worker_id': 1, 'performance': {'accuracy': 0.8493999910354614}}]
INFO:root:Clean worker folder /gpfs2/scratch/ntraft/Development/vega/tasks/0405.122724.100/workers/nas.
INFO:root:------------------------------------------------
INFO:root: Step: fully_train
INFO:root:------------------------------------------------
INFO:vega.core.pipeline.train_pipe_step:init TrainPipeStep...
INFO:vega.core.pipeline.train_pipe_step:TrainPipeStep started...
INFO:root:Model was created.
/users/n/t/ntraft/miniconda3/envs/vega/lib/python3.8/site-packages/thop/vision/basic_hooks.py:92: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
kernel = torch.DoubleTensor([*(x[0].shape[2:])]) // torch.DoubleTensor(list((m.output_size,))).squeeze()
INFO:root:flops: 0.406770112 , params:3014.8920000000003
ERROR:root:Failed to run worker, id: 1, message: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [96, 144, 8, 8]], which is output 0 of ReluBackward0, is at version 2; expected version 0 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).
INFO:root:start evaluate process
INFO:vega.evaluator.evaluator:evalaute model without loading weights file
INFO:root:Model was created.
INFO:root:step [1/40], valid metric [[[tensor(0.0781, device='cuda:0'), tensor(0.5039, device='cuda:0')]]]
INFO:root:step [11/40], valid metric [[[tensor(0.0859, device='cuda:0'), tensor(0.4648, device='cuda:0')]]]
INFO:root:step [21/40], valid metric [[[tensor(0.1055, device='cuda:0'), tensor(0.4766, device='cuda:0')]]]
INFO:root:step [31/40], valid metric [[[tensor(0.0898, device='cuda:0'), tensor(0.4688, device='cuda:0')]]]
INFO:root:evaluator latency [106.40701539814472]
INFO:root:evaluate performance: {'accuracy': 0.1, 'accuracy_top1': 0.1, 'accuracy_top5': 0.5, 'latency': 106.40701539814472}
INFO:root:finished host evaluation, id: 1, performance: {'accuracy': 0.1, 'accuracy_top1': 0.1, 'accuracy_top5': 0.5, 'latency': 106.40701539814472}
INFO:root:------------------------------------------------
INFO:root: Pipeline end.
INFO:root:
INFO:root: task id: 0405.122724.100
INFO:root: output folder: /gpfs2/scratch/ntraft/Development/vega/tasks/0405.122724.100/output
INFO:root:
INFO:root: running time:
INFO:root: nas: 22:37:42 [2022-04-05 12:28:51.638879 - 2022-04-06 11:06:34.411202]
INFO:root: fully_train: 0:00:15 [2022-04-06 11:06:34.426958 - 2022-04-06 11:06:50.167165]
INFO:root:
INFO:root: result:
INFO:root: 1: {'flops': 0.406770112, 'params': 3014.8920000000003, 'accuracy': 0.1, 'accuracy_top1': 0.1, 'accuracy_top5': 0.5, 'latency': 106.40701539814472}
INFO:root:------------------------------------------------
Note that it does say "Failed to run worker", and also those numbers look suspiciously discrete to me ('accuracy_top1': 0.1, 'accuracy_top5': 0.5
).
I also tried to resume the run with this command:
python vega/tools/run_pipeline.py examples/nas/cars/cars.yml --resume -t 0405.122724.100
But it seemed to start over from the beginning of the nas
step.
I also tried re-running just the fully_train
step. I did this by editing the config to delete the nas
part so that I only had pipeline: [fully_train]
, then resuming as above. This yielded a similar error, but with a stack trace:
Traceback (most recent call last):
File "/gpfs2/scratch/ntraft/Development/vega/vega/core/scheduler/local_master.py", line 64, in run
worker.train_process()
File "/gpfs2/scratch/ntraft/Development/vega/vega/common/wrappers.py", line 75, in wrapper
r = func(self, *args, **kwargs)
File "/gpfs2/scratch/ntraft/Development/vega/vega/trainer/trainer_base.py", line 136, in train_process
self._train_loop()
File "/gpfs2/scratch/ntraft/Development/vega/vega/trainer/trainer_base.py", line 277, in _train_loop
self._train_epoch()
File "/gpfs2/scratch/ntraft/Development/vega/vega/trainer/trainer_torch.py", line 109, in _train_epoch
train_batch_output = self.train_step(batch)
File "/gpfs2/scratch/ntraft/Development/vega/vega/trainer/trainer_torch.py", line 175, in _default_train_step
loss.backward()
File "/users/n/t/ntraft/miniconda3/envs/vega-amd/lib/python3.8/site-packages/torch/_tensor.py", line 307, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/users/n/t/ntraft/miniconda3/envs/vega-amd/lib/python3.8/site-packages/torch/autograd/__init__.py", line 154, in backward
Variable._execution_engine.run_backward(
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [96, 144, 8, 8]], which is output 0 of ReluBackward0, is at version 2; expected version 0 instead. Hint:
enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).
And the numbers were slightly different:
evaluate performance: {'accuracy': 0.10006009615384616, 'accuracy_top1': 0.10006009615384616, 'accuracy_top5': 0.5001001602564102, 'latency': 333.22229012846947}
Finally, I re-ran one more time with torch.autograd.set_detect_anomaly(True)
as the error suggests, and here is the new error I get (sorry, it's a long one):
MIOpen(HIP): Warning [SQLiteBase] Unable to read system database file:gfx906_60.kdb Performance may degrade
/users/n/t/ntraft/miniconda3/envs/vega-amd/lib/python3.8/site-packages/thop/vision/basic_hooks.py:92: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like
the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor')
.
kernel = torch.DoubleTensor([*(x[0].shape[2:])]) // torch.DoubleTensor(list((m.output_size,))).squeeze()
[2022-04-06 17:08:13] [INFO] flops: 0.406770112 , params:3014.8920000000003
[2022-04-06 17:08:13] [INFO] skip loading checkpoint file that do not exist, /gpfs2/scratch/ntraft/Development/vega/tasks/0405.122724.100/workers/fully_train/1/checkpoint.pth
[W python_anomaly_mode.cpp:104] Warning: Error detected in ReluBackward0. Traceback of forward call that caused the error:
File "vega/tools/run_pipeline.py", line 239, in <module>
main()
File "vega/tools/run_pipeline.py", line 235, in main
vega.run(config)
File "/gpfs2/scratch/ntraft/Development/vega/vega/core/run.py", line 43, in run
_run_pipeline()
File "/gpfs2/scratch/ntraft/Development/vega/vega/core/run.py", line 72, in _run_pipeline
Pipeline().run()
File "/gpfs2/scratch/ntraft/Development/vega/vega/core/pipeline/pipeline.py", line 86, in run
pipestep.do()
File "/gpfs2/scratch/ntraft/Development/vega/vega/core/pipeline/train_pipe_step.py", line 56, in do
self._train_multi_models(records)
File "/gpfs2/scratch/ntraft/Development/vega/vega/core/pipeline/train_pipe_step.py", line 87, in _train_multi_models
self.train_model(trainer)
File "/gpfs2/scratch/ntraft/Development/vega/vega/core/pipeline/train_pipe_step.py", line 111, in train_model
self.master.run(trainer, evaluator)
File "/gpfs2/scratch/ntraft/Development/vega/vega/core/scheduler/local_master.py", line 64, in run
worker.train_process()
File "/gpfs2/scratch/ntraft/Development/vega/vega/common/wrappers.py", line 75, in wrapper
r = func(self, *args, **kwargs)
File "/gpfs2/scratch/ntraft/Development/vega/vega/trainer/trainer_base.py", line 136, in train_process
self._train_loop()
File "/gpfs2/scratch/ntraft/Development/vega/vega/trainer/trainer_base.py", line 277, in _train_loop
self._train_epoch()
File "/gpfs2/scratch/ntraft/Development/vega/vega/trainer/trainer_torch.py", line 109, in _train_epoch
train_batch_output = self.train_step(batch)
File "/gpfs2/scratch/ntraft/Development/vega/vega/trainer/trainer_torch.py", line 166, in _default_train_step
output = self.model(input) if not isinstance(input, dict) else self.model(**input)
File "/users/n/t/ntraft/miniconda3/envs/vega-amd/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/gpfs2/scratch/ntraft/Development/vega/vega/modules/operators/functions/pytorch_fn.py", line 183, in forward
return self.call(inputs, *args, **kwargs)
File "/gpfs2/scratch/ntraft/Development/vega/vega/networks/super_network.py", line 104, in call
s0, s1 = s1, cell(s0, s1, weights, self.drop_path_prob)
File "/users/n/t/ntraft/miniconda3/envs/vega-amd/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/gpfs2/scratch/ntraft/Development/vega/vega/modules/operators/functions/pytorch_fn.py", line 183, in forward
return self.call(inputs, *args, **kwargs)
File "/gpfs2/scratch/ntraft/Development/vega/vega/modules/operators/cell.py", line 118, in call
h = op(states[inp])
File "/users/n/t/ntraft/miniconda3/envs/vega-amd/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/gpfs2/scratch/ntraft/Development/vega/vega/modules/operators/functions/pytorch_fn.py", line 183, in forward
return self.call(inputs, *args, **kwargs)
File "/gpfs2/scratch/ntraft/Development/vega/vega/modules/operators/mix_ops.py", line 104, in call
x = model(x)
File "/users/n/t/ntraft/miniconda3/envs/vega-amd/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/gpfs2/scratch/ntraft/Development/vega/vega/modules/operators/functions/pytorch_fn.py", line 183, in forward
return self.call(inputs, *args, **kwargs)
File "/gpfs2/scratch/ntraft/Development/vega/vega/modules/operators/functions/pytorch_fn.py", line 164, in call
output = model(output, *args, **kwargs)
File "/users/n/t/ntraft/miniconda3/envs/vega-amd/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/gpfs2/scratch/ntraft/Development/vega/vega/modules/operators/functions/pytorch_fn.py", line 392, in forward
return super().forward(x)
File "/users/n/t/ntraft/miniconda3/envs/vega-amd/lib/python3.8/site-packages/torch/nn/modules/activation.py", line 98, in forward
return F.relu(input, inplace=self.inplace)
File "/users/n/t/ntraft/miniconda3/envs/vega-amd/lib/python3.8/site-packages/torch/nn/functional.py", line 1299, in relu
result = torch.relu(input)
(function _print_stack)
[2022-04-06 17:08:19] [ERROR] Traceback (most recent call last):
File "/gpfs2/scratch/ntraft/Development/vega/vega/core/scheduler/local_master.py", line 64, in run
worker.train_process()
File "/gpfs2/scratch/ntraft/Development/vega/vega/common/wrappers.py", line 75, in wrapper
r = func(self, *args, **kwargs)
File "/gpfs2/scratch/ntraft/Development/vega/vega/trainer/trainer_base.py", line 136, in train_process
self._train_loop()
File "/gpfs2/scratch/ntraft/Development/vega/vega/trainer/trainer_base.py", line 277, in _train_loop
self._train_epoch()
File "/gpfs2/scratch/ntraft/Development/vega/vega/trainer/trainer_torch.py", line 109, in _train_epoch
train_batch_output = self.train_step(batch)
File "/gpfs2/scratch/ntraft/Development/vega/vega/trainer/trainer_torch.py", line 175, in _default_train_step
loss.backward()
File "/users/n/t/ntraft/miniconda3/envs/vega-amd/lib/python3.8/site-packages/torch/_tensor.py", line 307, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/users/n/t/ntraft/miniconda3/envs/vega-amd/lib/python3.8/site-packages/torch/autograd/__init__.py", line 154, in backward
Variable._execution_engine.run_backward(
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [96, 144, 8, 8]], which is output 0 of ReluBackward0, is at version 2; expected version 0 instead. Hint:
the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!
I was able to get the example to run to the end. Unrelated to MPI, I have another question:
Is it running successfully? For a successful run, what kind of output should I expect?
Because I don't see any output model anywhere. The doc here says that it should output a model at
tasks/<task id\>/output/fully_train/model_<id\>.pth
, but for me this folder only contains:$ l tasks/0405.122724.100/output/fully_train/ desc_1.json output.csv performance_1.json
I can see some
.pt
files intasks/0405.122724.100/workers/nas/0/
, but those appear to only be checkpoints made during thenas
stage.Here is the output from the end of my pipeline:
INFO:root:worker id [0], epoch [500/500], train step [180/196], loss [ 0.035, 0.083], lr [ 0.0010004], time pre batch [0.601s] , total mean time per batch [0.615s] INFO:root:worker id [0], epoch [500/500], train step [190/196], loss [ 0.078, 0.083], lr [ 0.0010004], time pre batch [0.664s] , total mean time per batch [0.615s] INFO:root:Finished the unified trainer successfully. INFO:root:Update Success. step_name=nas, worker_id=0 INFO:root:Best values: [{'worker_id': 1, 'performance': {'accuracy': 0.8493999910354614}}] INFO:root:Clean worker folder /gpfs2/scratch/ntraft/Development/vega/tasks/0405.122724.100/workers/nas. INFO:root:------------------------------------------------ INFO:root: Step: fully_train INFO:root:------------------------------------------------ INFO:vega.core.pipeline.train_pipe_step:init TrainPipeStep... INFO:vega.core.pipeline.train_pipe_step:TrainPipeStep started... INFO:root:Model was created. /users/n/t/ntraft/miniconda3/envs/vega/lib/python3.8/site-packages/thop/vision/basic_hooks.py:92: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). kernel = torch.DoubleTensor([*(x[0].shape[2:])]) // torch.DoubleTensor(list((m.output_size,))).squeeze() INFO:root:flops: 0.406770112 , params:3014.8920000000003 ERROR:root:Failed to run worker, id: 1, message: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [96, 144, 8, 8]], which is output 0 of ReluBackward0, is at version 2; expected version 0 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True). INFO:root:start evaluate process INFO:vega.evaluator.evaluator:evalaute model without loading weights file INFO:root:Model was created. INFO:root:step [1/40], valid metric [[[tensor(0.0781, device='cuda:0'), tensor(0.5039, device='cuda:0')]]] INFO:root:step [11/40], valid metric [[[tensor(0.0859, device='cuda:0'), tensor(0.4648, device='cuda:0')]]] INFO:root:step [21/40], valid metric [[[tensor(0.1055, device='cuda:0'), tensor(0.4766, device='cuda:0')]]] INFO:root:step [31/40], valid metric [[[tensor(0.0898, device='cuda:0'), tensor(0.4688, device='cuda:0')]]] INFO:root:evaluator latency [106.40701539814472] INFO:root:evaluate performance: {'accuracy': 0.1, 'accuracy_top1': 0.1, 'accuracy_top5': 0.5, 'latency': 106.40701539814472} INFO:root:finished host evaluation, id: 1, performance: {'accuracy': 0.1, 'accuracy_top1': 0.1, 'accuracy_top5': 0.5, 'latency': 106.40701539814472} INFO:root:------------------------------------------------ INFO:root: Pipeline end. INFO:root: INFO:root: task id: 0405.122724.100 INFO:root: output folder: /gpfs2/scratch/ntraft/Development/vega/tasks/0405.122724.100/output INFO:root: INFO:root: running time: INFO:root: nas: 22:37:42 [2022-04-05 12:28:51.638879 - 2022-04-06 11:06:34.411202] INFO:root: fully_train: 0:00:15 [2022-04-06 11:06:34.426958 - 2022-04-06 11:06:50.167165] INFO:root: INFO:root: result: INFO:root: 1: {'flops': 0.406770112, 'params': 3014.8920000000003, 'accuracy': 0.1, 'accuracy_top1': 0.1, 'accuracy_top5': 0.5, 'latency': 106.40701539814472} INFO:root:------------------------------------------------
Note that it does say "Failed to run worker", and also those numbers look suspiciously discrete to me (
'accuracy_top1': 0.1, 'accuracy_top5': 0.5
).I also tried to resume the run with this command:
python vega/tools/run_pipeline.py examples/nas/cars/cars.yml --resume -t 0405.122724.100
But it seemed to start over from the beginning of the
nas
step.I also tried re-running just the
fully_train
step. I did this by editing the config to delete thenas
part so that I only hadpipeline: [fully_train]
, then resuming as above. This yielded a similar error, but with a stack trace:Traceback (most recent call last): File "/gpfs2/scratch/ntraft/Development/vega/vega/core/scheduler/local_master.py", line 64, in run worker.train_process() File "/gpfs2/scratch/ntraft/Development/vega/vega/common/wrappers.py", line 75, in wrapper r = func(self, *args, **kwargs) File "/gpfs2/scratch/ntraft/Development/vega/vega/trainer/trainer_base.py", line 136, in train_process self._train_loop() File "/gpfs2/scratch/ntraft/Development/vega/vega/trainer/trainer_base.py", line 277, in _train_loop self._train_epoch() File "/gpfs2/scratch/ntraft/Development/vega/vega/trainer/trainer_torch.py", line 109, in _train_epoch train_batch_output = self.train_step(batch) File "/gpfs2/scratch/ntraft/Development/vega/vega/trainer/trainer_torch.py", line 175, in _default_train_step loss.backward() File "/users/n/t/ntraft/miniconda3/envs/vega-amd/lib/python3.8/site-packages/torch/_tensor.py", line 307, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) File "/users/n/t/ntraft/miniconda3/envs/vega-amd/lib/python3.8/site-packages/torch/autograd/__init__.py", line 154, in backward Variable._execution_engine.run_backward( RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [96, 144, 8, 8]], which is output 0 of ReluBackward0, is at version 2; expected version 0 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).
And the numbers were slightly different:
evaluate performance: {'accuracy': 0.10006009615384616, 'accuracy_top1': 0.10006009615384616, 'accuracy_top5': 0.5001001602564102, 'latency': 333.22229012846947}
Finally, I re-ran one more time with
torch.autograd.set_detect_anomaly(True)
as the error suggests, and here is the new error I get (sorry, it's a long one):MIOpen(HIP): Warning [SQLiteBase] Unable to read system database file:gfx906_60.kdb Performance may degrade /users/n/t/ntraft/miniconda3/envs/vega-amd/lib/python3.8/site-packages/thop/vision/basic_hooks.py:92: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor') . kernel = torch.DoubleTensor([*(x[0].shape[2:])]) // torch.DoubleTensor(list((m.output_size,))).squeeze() [2022-04-06 17:08:13] [INFO] flops: 0.406770112 , params:3014.8920000000003 [2022-04-06 17:08:13] [INFO] skip loading checkpoint file that do not exist, /gpfs2/scratch/ntraft/Development/vega/tasks/0405.122724.100/workers/fully_train/1/checkpoint.pth [W python_anomaly_mode.cpp:104] Warning: Error detected in ReluBackward0. Traceback of forward call that caused the error: File "vega/tools/run_pipeline.py", line 239, in <module> main() File "vega/tools/run_pipeline.py", line 235, in main vega.run(config) File "/gpfs2/scratch/ntraft/Development/vega/vega/core/run.py", line 43, in run _run_pipeline() File "/gpfs2/scratch/ntraft/Development/vega/vega/core/run.py", line 72, in _run_pipeline Pipeline().run() File "/gpfs2/scratch/ntraft/Development/vega/vega/core/pipeline/pipeline.py", line 86, in run pipestep.do() File "/gpfs2/scratch/ntraft/Development/vega/vega/core/pipeline/train_pipe_step.py", line 56, in do self._train_multi_models(records) File "/gpfs2/scratch/ntraft/Development/vega/vega/core/pipeline/train_pipe_step.py", line 87, in _train_multi_models self.train_model(trainer) File "/gpfs2/scratch/ntraft/Development/vega/vega/core/pipeline/train_pipe_step.py", line 111, in train_model self.master.run(trainer, evaluator) File "/gpfs2/scratch/ntraft/Development/vega/vega/core/scheduler/local_master.py", line 64, in run worker.train_process() File "/gpfs2/scratch/ntraft/Development/vega/vega/common/wrappers.py", line 75, in wrapper r = func(self, *args, **kwargs) File "/gpfs2/scratch/ntraft/Development/vega/vega/trainer/trainer_base.py", line 136, in train_process self._train_loop() File "/gpfs2/scratch/ntraft/Development/vega/vega/trainer/trainer_base.py", line 277, in _train_loop self._train_epoch() File "/gpfs2/scratch/ntraft/Development/vega/vega/trainer/trainer_torch.py", line 109, in _train_epoch train_batch_output = self.train_step(batch) File "/gpfs2/scratch/ntraft/Development/vega/vega/trainer/trainer_torch.py", line 166, in _default_train_step output = self.model(input) if not isinstance(input, dict) else self.model(**input) File "/users/n/t/ntraft/miniconda3/envs/vega-amd/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/gpfs2/scratch/ntraft/Development/vega/vega/modules/operators/functions/pytorch_fn.py", line 183, in forward return self.call(inputs, *args, **kwargs) File "/gpfs2/scratch/ntraft/Development/vega/vega/networks/super_network.py", line 104, in call s0, s1 = s1, cell(s0, s1, weights, self.drop_path_prob) File "/users/n/t/ntraft/miniconda3/envs/vega-amd/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/gpfs2/scratch/ntraft/Development/vega/vega/modules/operators/functions/pytorch_fn.py", line 183, in forward return self.call(inputs, *args, **kwargs) File "/gpfs2/scratch/ntraft/Development/vega/vega/modules/operators/cell.py", line 118, in call h = op(states[inp]) File "/users/n/t/ntraft/miniconda3/envs/vega-amd/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/gpfs2/scratch/ntraft/Development/vega/vega/modules/operators/functions/pytorch_fn.py", line 183, in forward return self.call(inputs, *args, **kwargs) File "/gpfs2/scratch/ntraft/Development/vega/vega/modules/operators/mix_ops.py", line 104, in call x = model(x) File "/users/n/t/ntraft/miniconda3/envs/vega-amd/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/gpfs2/scratch/ntraft/Development/vega/vega/modules/operators/functions/pytorch_fn.py", line 183, in forward return self.call(inputs, *args, **kwargs) File "/gpfs2/scratch/ntraft/Development/vega/vega/modules/operators/functions/pytorch_fn.py", line 164, in call output = model(output, *args, **kwargs) File "/users/n/t/ntraft/miniconda3/envs/vega-amd/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/gpfs2/scratch/ntraft/Development/vega/vega/modules/operators/functions/pytorch_fn.py", line 392, in forward return super().forward(x) File "/users/n/t/ntraft/miniconda3/envs/vega-amd/lib/python3.8/site-packages/torch/nn/modules/activation.py", line 98, in forward return F.relu(input, inplace=self.inplace) File "/users/n/t/ntraft/miniconda3/envs/vega-amd/lib/python3.8/site-packages/torch/nn/functional.py", line 1299, in relu result = torch.relu(input) (function _print_stack) [2022-04-06 17:08:19] [ERROR] Traceback (most recent call last): File "/gpfs2/scratch/ntraft/Development/vega/vega/core/scheduler/local_master.py", line 64, in run worker.train_process() File "/gpfs2/scratch/ntraft/Development/vega/vega/common/wrappers.py", line 75, in wrapper r = func(self, *args, **kwargs) File "/gpfs2/scratch/ntraft/Development/vega/vega/trainer/trainer_base.py", line 136, in train_process self._train_loop() File "/gpfs2/scratch/ntraft/Development/vega/vega/trainer/trainer_base.py", line 277, in _train_loop self._train_epoch() File "/gpfs2/scratch/ntraft/Development/vega/vega/trainer/trainer_torch.py", line 109, in _train_epoch train_batch_output = self.train_step(batch) File "/gpfs2/scratch/ntraft/Development/vega/vega/trainer/trainer_torch.py", line 175, in _default_train_step loss.backward() File "/users/n/t/ntraft/miniconda3/envs/vega-amd/lib/python3.8/site-packages/torch/_tensor.py", line 307, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) File "/users/n/t/ntraft/miniconda3/envs/vega-amd/lib/python3.8/site-packages/torch/autograd/__init__.py", line 154, in backward Variable._execution_engine.run_backward( RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [96, 144, 8, 8]], which is output 0 of ReluBackward0, is at version 2; expected version 0 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!
@ntraft Hi, Thank you for using our automl tool.
If it runs successfully, there will be four files in the directory 'tasks/<task id>/workers/fully_train/0'.
checkpoint.pth desc_0.json model_0.pth performance_0.json
There is no *pth file in the 'tasks/0405.122724.100/output/fully_train/' directory because the runs failed. An error is reported here, and we are analyzing this problem.
ERROR:root:Failed to run worker, id: 1, message: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [96, 144, 8, 8]], which is output 0 of ReluBackward0, is at version 2; expected version 0 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).
The evaluation results are not discrete. You can see the evaluation results of some intermediate steps.
INFO:root:step [1/40], valid metric [[[tensor(0.0781, device='cuda:0'), tensor(0.5039, device='cuda:0')]]]
INFO:root:step [11/40], valid metric [[[tensor(0.0859, device='cuda:0'), tensor(0.4648, device='cuda:0')]]]
INFO:root:step [21/40], valid metric [[[tensor(0.1055, device='cuda:0'), tensor(0.4766, device='cuda:0')]]]
INFO:root:step [31/40], valid metric [[[tensor(0.0898, device='cuda:0'), tensor(0.4688, device='cuda:0')]]]
@ntraft Hi Can you tell me the version of torch you use, and do you use it on an NVIDIA card ?
Thanks!
As a matter of fact, I reproduced this error on both types of nodes that I have access to:
Thanks!
As a matter of fact, I reproduced this error on both types of nodes that I have access to:
- Torch version: 1.11.0, CUDA version: 10.2, NVIDIA Tesla V100 (32GB) (originally ran here)
- Torch version: 1.10.2+rocm4.2, AMD Radeon Instinct MI50 (32GB) (re-ran the fully_train here)
@ntraft At present, the latest version of Torch supported by the noah-vega is 1.5.0, and you can re-running with torch==1.5.0.
@ntraft
We found the following message in your log:
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [96, 144, 8, 8]], which is output 0 of ReluBackward0, is at version 2; expected version 0 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!
You can try using an earlier version of the torch. This issue will be resolved in a later version.
Seems to be working with torch 1.5.0 on NVIDIA Tesla V100.
I'd like to check with you to see if I am using the best settings for performance. By default, it looks like the fully_train
step is set to run for 600 epochs (on CIFAR-10), but even after running for 48 hours I only had 393 epochs. However, I was only using 1 NVIDIA Tesla V100 (for me this gave total mean time per batch [0.760s]
). If I restart the fully_train
step with 2 GPUs and 4 CPUs, there is no improvement unless I use -pt true
. But with -pt
, now I get 0.240s
. Going up to 4 GPUs only helped very slightly (0.235s
). Going up to 8 GPUs, 8 CPUs didn't help at all after that (still 0.235s
). I also made sure to use NUMEXPR_MAX_THREADS=<number-of-cores>
each time.
Any other guidance on how to parallelize effectively (for both nas
and fully_train
)?
@ntraft
You can try using Horovod by modifying the TrainPipeStep
to HorovodTrainStep
:
You only need to modify this paremeter in the original configuration file. You do not need to use the -pt
parameter.
Horovod needs to be installed before running. `pip3 install horovod'.
fully_train:
pipe_step:
# type: TrainPipeStep
type: HorovodTrainStep
To retrain a previously searched network, modify the models_folder
parameter and use an absolute path.
fully_train:
pipe_step:
# models_folder: "{local_base_path}/output/nas/"
models_folder: "/gpfs2/scratch/ntraft/Development/vega/tasks/0405.122724.100/output/nas"
This might be related, I also tried CARS on a custom dataset and the NAS step works, but the fullytrain fails:
2022-06-11 19:55:02.401 INFO ------------------------------------------------ 2022-06-11 19:55:02.402 INFO Step: fullytrain 2022-06-11 19:55:02.404 INFO ------------------------------------------------ 2022-06-11 19:55:02.410 INFO init TrainPipeStep... 2022-06-11 19:55:02.411 INFO TrainPipeStep started... 2022-06-11 19:55:02.561 INFO Model was created. 2022-06-11 19:55:03.371 WARNING model statics failed, ex=conv2d() received an invalid combination of arguments - got (NoneType, Parameter, NoneType, tuple, tuple, tuple, int), but expected one of:
2022-06-11 19:55:04.665 ERROR Failed to run worker, id: 2, message: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [96, 144, 8, 8]], which is output 0 of ReluBackward0, is at version 2; expected version 0 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!
2022-06-11 19:55:04.674 INFO start evaluate process
2022-06-11 19:55:04.677 INFO evalaute model without loading weights file
2022-06-11 19:55:04.813 INFO Model was created.
/usr/local/lib/python3.7/dist-packages/torch/autograd/init.py:175: UserWarning: Error detected in ReluBackward0. Traceback of forward call that caused the error:
File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py", line 16, in
Please provide the version of the torch, cuda, python, and as well as the YAML configuration file. @peter-ch
Hello,
Thanks so much for this rich AutoML project! I enjoyed the CARS paper and I'm excited to try it out.
The basic example given here is not working properly:
vega ./nas/cars/cars.yml
The first error I got was a basic syntax error introduced in Release 1.8.2, which I fixed in my fork here.
Now, I have an error which is not as simple for me to figure out. Here is some of the output:
MIOpen(HIP): Warning [SQLiteBase] Unable to read system database file:gfx906_60.kdb Performance may degrade MIOpen(HIP): Warning [FindWinogradSolutions] /MIOpen/src/sqlite_db.cpp:107: open memvfs: unable to open database file MIOpen(HIP): Warning [FindDataDirectSolutions] /MIOpen/src/sqlite_db.cpp:107: open memvfs: unable to open database file MIOpen(HIP): Warning [FindDataImplicitGemmSolutions] /MIOpen/src/sqlite_db.cpp:107: open memvfs: unable to open database file MIOpen(HIP): Warning [FindFftSolutions] /MIOpen/src/sqlite_db.cpp:107: open memvfs: unable to open database file MIOpen Error: /MIOpen/src/sqlite_db.cpp:107: open memvfs: unable to open database file ERROR:root:Failed to run worker, id: 0, message: miopenStatusInternalError INFO:root:Update Success. step_name=nas, worker_id=0 INFO:root:waiting for the workers [0] to finish INFO:root:Best values: [] WARNING:root:Failed to dump pareto front records, report is emplty. INFO:root:------------------------------------------------ INFO:root: Step: fully_train INFO:root:------------------------------------------------ INFO:vega.core.pipeline.train_pipe_step:init TrainPipeStep... INFO:vega.core.pipeline.train_pipe_step:TrainPipeStep started... WARNING:root:Failed to dump records, report is emplty.
I'm not sure if this has anything to do with MPI, but my
mpirun
simply prints a help message without error.I tried running this two ways, and got the same error. One way:
$ python vega/tools/run_pipeline.py examples/nas/cars/cars.yml
Second way:
$ pip install . $ vega examples/nas/cars/cars.yml
Hi, I'm new to ML in general, I was wondering if you got the cars example running correctly. Did you have to change the yml in some way to get it working?
@Isaspinach I abandoned this project. I don't think I made any changes to the yml? Not sure though. I did have to make one small code change to fix an error.
You are probably better off using a more recent and well-supported method though. I don't really know if there are any stable NAS libraries out there. I feel like research in that area has been less active in recent years.
Hello,
Thanks so much for this rich AutoML project! I enjoyed the CARS paper and I'm excited to try it out.
The basic example given here is not working properly:
The first error I got was a basic syntax error introduced in Release 1.8.2, which I fixed in my fork here.
Now, I have an error which is not as simple for me to figure out. Here is some of the output:
I'm not sure if this has anything to do with MPI, but my
mpirun
simply prints a help message without error.I tried running this two ways, and got the same error. One way:
Second way: