huawei-noah / vega

AutoML tools chain
http://www.noahlab.com.hk/opensource/vega/
Other
840 stars 177 forks source link

Basic CARS example not working #231

Open ntraft opened 2 years ago

ntraft commented 2 years ago

Hello,

Thanks so much for this rich AutoML project! I enjoyed the CARS paper and I'm excited to try it out.

The basic example given here is not working properly:

vega ./nas/cars/cars.yml

The first error I got was a basic syntax error introduced in Release 1.8.2, which I fixed in my fork here.

Now, I have an error which is not as simple for me to figure out. Here is some of the output:

MIOpen(HIP): Warning [SQLiteBase] Unable to read system database file:gfx906_60.kdb Performance may degrade
MIOpen(HIP): Warning [FindWinogradSolutions] /MIOpen/src/sqlite_db.cpp:107: open memvfs: unable to open database file
MIOpen(HIP): Warning [FindDataDirectSolutions] /MIOpen/src/sqlite_db.cpp:107: open memvfs: unable to open database file
MIOpen(HIP): Warning [FindDataImplicitGemmSolutions] /MIOpen/src/sqlite_db.cpp:107: open memvfs: unable to open database file
MIOpen(HIP): Warning [FindFftSolutions] /MIOpen/src/sqlite_db.cpp:107: open memvfs: unable to open database file
MIOpen Error: /MIOpen/src/sqlite_db.cpp:107: open memvfs: unable to open database file
ERROR:root:Failed to run worker, id: 0, message: miopenStatusInternalError
INFO:root:Update Success. step_name=nas, worker_id=0
INFO:root:waiting for the workers [0] to finish
INFO:root:Best values: []
WARNING:root:Failed to dump pareto front records, report is emplty.
INFO:root:------------------------------------------------
INFO:root:  Step: fully_train
INFO:root:------------------------------------------------
INFO:vega.core.pipeline.train_pipe_step:init TrainPipeStep...
INFO:vega.core.pipeline.train_pipe_step:TrainPipeStep started...
WARNING:root:Failed to dump records, report is emplty.

I'm not sure if this has anything to do with MPI, but my mpirun simply prints a help message without error.

I tried running this two ways, and got the same error. One way:

$  python vega/tools/run_pipeline.py examples/nas/cars/cars.yml

Second way:

$  pip install .
$  vega examples/nas/cars/cars.yml
ntraft commented 2 years ago

Update: I think this is actually an issue with PyTorch on AMD as described here, and I have reproduced it outside of Vega. I am currently going through the fixes in that thread to see if they can help, and also trying to get access to an NVIDIA card so I can see if that works.

In that case, the only issue is the one I fixed in my fork. If I can confirm it works, I will submit a pull request and close this issue.

zhangjiajin commented 2 years ago

@ntraft :+1:

ntraft commented 2 years ago

I was able to get the example to run to the end. Unrelated to MPI, I have another question:

Is it running successfully? For a successful run, what kind of output should I expect?

Because I don't see any output model anywhere. The doc here says that it should output a model at tasks/<task id\>/output/fully_train/model_<id\>.pth, but for me this folder only contains:

$ l tasks/0405.122724.100/output/fully_train/
desc_1.json  output.csv  performance_1.json

I can see some .pt files in tasks/0405.122724.100/workers/nas/0/, but those appear to only be checkpoints made during the nas stage.

Here is the output from the end of my pipeline:

INFO:root:worker id [0], epoch [500/500], train step [180/196], loss [   0.035,    0.083], lr [   0.0010004],  time pre batch [0.601s] , total mean time per batch [0.615s]
INFO:root:worker id [0], epoch [500/500], train step [190/196], loss [   0.078,    0.083], lr [   0.0010004],  time pre batch [0.664s] , total mean time per batch [0.615s]
INFO:root:Finished the unified trainer successfully.
INFO:root:Update Success. step_name=nas, worker_id=0
INFO:root:Best values: [{'worker_id': 1, 'performance': {'accuracy': 0.8493999910354614}}]
INFO:root:Clean worker folder /gpfs2/scratch/ntraft/Development/vega/tasks/0405.122724.100/workers/nas.
INFO:root:------------------------------------------------
INFO:root:  Step: fully_train
INFO:root:------------------------------------------------
INFO:vega.core.pipeline.train_pipe_step:init TrainPipeStep...
INFO:vega.core.pipeline.train_pipe_step:TrainPipeStep started...
INFO:root:Model was created.
/users/n/t/ntraft/miniconda3/envs/vega/lib/python3.8/site-packages/thop/vision/basic_hooks.py:92: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
  kernel = torch.DoubleTensor([*(x[0].shape[2:])]) // torch.DoubleTensor(list((m.output_size,))).squeeze()
INFO:root:flops: 0.406770112 , params:3014.8920000000003
ERROR:root:Failed to run worker, id: 1, message: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [96, 144, 8, 8]], which is output 0 of ReluBackward0, is at version 2; expected version 0 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).
INFO:root:start evaluate process
INFO:vega.evaluator.evaluator:evalaute model without loading weights file
INFO:root:Model was created.
INFO:root:step [1/40], valid metric [[[tensor(0.0781, device='cuda:0'), tensor(0.5039, device='cuda:0')]]]
INFO:root:step [11/40], valid metric [[[tensor(0.0859, device='cuda:0'), tensor(0.4648, device='cuda:0')]]]
INFO:root:step [21/40], valid metric [[[tensor(0.1055, device='cuda:0'), tensor(0.4766, device='cuda:0')]]]
INFO:root:step [31/40], valid metric [[[tensor(0.0898, device='cuda:0'), tensor(0.4688, device='cuda:0')]]]
INFO:root:evaluator latency [106.40701539814472]
INFO:root:evaluate performance: {'accuracy': 0.1, 'accuracy_top1': 0.1, 'accuracy_top5': 0.5, 'latency': 106.40701539814472}
INFO:root:finished host evaluation, id: 1, performance: {'accuracy': 0.1, 'accuracy_top1': 0.1, 'accuracy_top5': 0.5, 'latency': 106.40701539814472}
INFO:root:------------------------------------------------
INFO:root:  Pipeline end.
INFO:root:
INFO:root:  task id: 0405.122724.100
INFO:root:  output folder: /gpfs2/scratch/ntraft/Development/vega/tasks/0405.122724.100/output
INFO:root:
INFO:root:  running time:
INFO:root:               nas:  22:37:42  [2022-04-05 12:28:51.638879 - 2022-04-06 11:06:34.411202]
INFO:root:       fully_train:  0:00:15  [2022-04-06 11:06:34.426958 - 2022-04-06 11:06:50.167165]
INFO:root:
INFO:root:  result:
INFO:root:    1:  {'flops': 0.406770112, 'params': 3014.8920000000003, 'accuracy': 0.1, 'accuracy_top1': 0.1, 'accuracy_top5': 0.5, 'latency': 106.40701539814472}
INFO:root:------------------------------------------------

Note that it does say "Failed to run worker", and also those numbers look suspiciously discrete to me ('accuracy_top1': 0.1, 'accuracy_top5': 0.5).

I also tried to resume the run with this command:

python vega/tools/run_pipeline.py examples/nas/cars/cars.yml --resume -t 0405.122724.100

But it seemed to start over from the beginning of the nas step.

I also tried re-running just the fully_train step. I did this by editing the config to delete the nas part so that I only had pipeline: [fully_train], then resuming as above. This yielded a similar error, but with a stack trace:

Traceback (most recent call last):
  File "/gpfs2/scratch/ntraft/Development/vega/vega/core/scheduler/local_master.py", line 64, in run
    worker.train_process()
  File "/gpfs2/scratch/ntraft/Development/vega/vega/common/wrappers.py", line 75, in wrapper
    r = func(self, *args, **kwargs)
  File "/gpfs2/scratch/ntraft/Development/vega/vega/trainer/trainer_base.py", line 136, in train_process
    self._train_loop()
  File "/gpfs2/scratch/ntraft/Development/vega/vega/trainer/trainer_base.py", line 277, in _train_loop
    self._train_epoch()
  File "/gpfs2/scratch/ntraft/Development/vega/vega/trainer/trainer_torch.py", line 109, in _train_epoch
    train_batch_output = self.train_step(batch)
  File "/gpfs2/scratch/ntraft/Development/vega/vega/trainer/trainer_torch.py", line 175, in _default_train_step
    loss.backward()
  File "/users/n/t/ntraft/miniconda3/envs/vega-amd/lib/python3.8/site-packages/torch/_tensor.py", line 307, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/users/n/t/ntraft/miniconda3/envs/vega-amd/lib/python3.8/site-packages/torch/autograd/__init__.py", line 154, in backward
    Variable._execution_engine.run_backward(
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [96, 144, 8, 8]], which is output 0 of ReluBackward0, is at version 2; expected version 0 instead. Hint:
 enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

And the numbers were slightly different:

evaluate performance: {'accuracy': 0.10006009615384616, 'accuracy_top1': 0.10006009615384616, 'accuracy_top5': 0.5001001602564102, 'latency': 333.22229012846947}

Finally, I re-ran one more time with torch.autograd.set_detect_anomaly(True) as the error suggests, and here is the new error I get (sorry, it's a long one):

MIOpen(HIP): Warning [SQLiteBase] Unable to read system database file:gfx906_60.kdb Performance may degrade
/users/n/t/ntraft/miniconda3/envs/vega-amd/lib/python3.8/site-packages/thop/vision/basic_hooks.py:92: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like
 the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor')
.
  kernel = torch.DoubleTensor([*(x[0].shape[2:])]) // torch.DoubleTensor(list((m.output_size,))).squeeze()
[2022-04-06 17:08:13] [INFO] flops: 0.406770112 , params:3014.8920000000003
[2022-04-06 17:08:13] [INFO] skip loading checkpoint file that do not exist, /gpfs2/scratch/ntraft/Development/vega/tasks/0405.122724.100/workers/fully_train/1/checkpoint.pth
[W python_anomaly_mode.cpp:104] Warning: Error detected in ReluBackward0. Traceback of forward call that caused the error:
  File "vega/tools/run_pipeline.py", line 239, in <module>
    main()
  File "vega/tools/run_pipeline.py", line 235, in main
    vega.run(config)
  File "/gpfs2/scratch/ntraft/Development/vega/vega/core/run.py", line 43, in run
    _run_pipeline()
  File "/gpfs2/scratch/ntraft/Development/vega/vega/core/run.py", line 72, in _run_pipeline
    Pipeline().run()
  File "/gpfs2/scratch/ntraft/Development/vega/vega/core/pipeline/pipeline.py", line 86, in run
    pipestep.do()
  File "/gpfs2/scratch/ntraft/Development/vega/vega/core/pipeline/train_pipe_step.py", line 56, in do
    self._train_multi_models(records)
  File "/gpfs2/scratch/ntraft/Development/vega/vega/core/pipeline/train_pipe_step.py", line 87, in _train_multi_models
    self.train_model(trainer)
  File "/gpfs2/scratch/ntraft/Development/vega/vega/core/pipeline/train_pipe_step.py", line 111, in train_model
    self.master.run(trainer, evaluator)
  File "/gpfs2/scratch/ntraft/Development/vega/vega/core/scheduler/local_master.py", line 64, in run
    worker.train_process()
  File "/gpfs2/scratch/ntraft/Development/vega/vega/common/wrappers.py", line 75, in wrapper
    r = func(self, *args, **kwargs)
  File "/gpfs2/scratch/ntraft/Development/vega/vega/trainer/trainer_base.py", line 136, in train_process
    self._train_loop()
  File "/gpfs2/scratch/ntraft/Development/vega/vega/trainer/trainer_base.py", line 277, in _train_loop
    self._train_epoch()
  File "/gpfs2/scratch/ntraft/Development/vega/vega/trainer/trainer_torch.py", line 109, in _train_epoch
    train_batch_output = self.train_step(batch)
  File "/gpfs2/scratch/ntraft/Development/vega/vega/trainer/trainer_torch.py", line 166, in _default_train_step
    output = self.model(input) if not isinstance(input, dict) else self.model(**input)
  File "/users/n/t/ntraft/miniconda3/envs/vega-amd/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/gpfs2/scratch/ntraft/Development/vega/vega/modules/operators/functions/pytorch_fn.py", line 183, in forward
    return self.call(inputs, *args, **kwargs)
  File "/gpfs2/scratch/ntraft/Development/vega/vega/networks/super_network.py", line 104, in call
    s0, s1 = s1, cell(s0, s1, weights, self.drop_path_prob)
  File "/users/n/t/ntraft/miniconda3/envs/vega-amd/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/gpfs2/scratch/ntraft/Development/vega/vega/modules/operators/functions/pytorch_fn.py", line 183, in forward
    return self.call(inputs, *args, **kwargs)
  File "/gpfs2/scratch/ntraft/Development/vega/vega/modules/operators/cell.py", line 118, in call
    h = op(states[inp])
  File "/users/n/t/ntraft/miniconda3/envs/vega-amd/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/gpfs2/scratch/ntraft/Development/vega/vega/modules/operators/functions/pytorch_fn.py", line 183, in forward
    return self.call(inputs, *args, **kwargs)
  File "/gpfs2/scratch/ntraft/Development/vega/vega/modules/operators/mix_ops.py", line 104, in call
    x = model(x)
  File "/users/n/t/ntraft/miniconda3/envs/vega-amd/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/gpfs2/scratch/ntraft/Development/vega/vega/modules/operators/functions/pytorch_fn.py", line 183, in forward
    return self.call(inputs, *args, **kwargs)
  File "/gpfs2/scratch/ntraft/Development/vega/vega/modules/operators/functions/pytorch_fn.py", line 164, in call
    output = model(output, *args, **kwargs)
  File "/users/n/t/ntraft/miniconda3/envs/vega-amd/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/gpfs2/scratch/ntraft/Development/vega/vega/modules/operators/functions/pytorch_fn.py", line 392, in forward
    return super().forward(x)
  File "/users/n/t/ntraft/miniconda3/envs/vega-amd/lib/python3.8/site-packages/torch/nn/modules/activation.py", line 98, in forward
    return F.relu(input, inplace=self.inplace)
  File "/users/n/t/ntraft/miniconda3/envs/vega-amd/lib/python3.8/site-packages/torch/nn/functional.py", line 1299, in relu
    result = torch.relu(input)
 (function _print_stack)
[2022-04-06 17:08:19] [ERROR] Traceback (most recent call last):
  File "/gpfs2/scratch/ntraft/Development/vega/vega/core/scheduler/local_master.py", line 64, in run
    worker.train_process()
  File "/gpfs2/scratch/ntraft/Development/vega/vega/common/wrappers.py", line 75, in wrapper
    r = func(self, *args, **kwargs)
  File "/gpfs2/scratch/ntraft/Development/vega/vega/trainer/trainer_base.py", line 136, in train_process
    self._train_loop()
  File "/gpfs2/scratch/ntraft/Development/vega/vega/trainer/trainer_base.py", line 277, in _train_loop
    self._train_epoch()
  File "/gpfs2/scratch/ntraft/Development/vega/vega/trainer/trainer_torch.py", line 109, in _train_epoch
    train_batch_output = self.train_step(batch)
  File "/gpfs2/scratch/ntraft/Development/vega/vega/trainer/trainer_torch.py", line 175, in _default_train_step
    loss.backward()
  File "/users/n/t/ntraft/miniconda3/envs/vega-amd/lib/python3.8/site-packages/torch/_tensor.py", line 307, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/users/n/t/ntraft/miniconda3/envs/vega-amd/lib/python3.8/site-packages/torch/autograd/__init__.py", line 154, in backward
    Variable._execution_engine.run_backward(
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [96, 144, 8, 8]], which is output 0 of ReluBackward0, is at version 2; expected version 0 instead. Hint:
 the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!
hujiaxin0 commented 2 years ago

I was able to get the example to run to the end. Unrelated to MPI, I have another question:

Is it running successfully? For a successful run, what kind of output should I expect?

Because I don't see any output model anywhere. The doc here says that it should output a model at tasks/<task id\>/output/fully_train/model_<id\>.pth, but for me this folder only contains:

$ l tasks/0405.122724.100/output/fully_train/
desc_1.json  output.csv  performance_1.json

I can see some .pt files in tasks/0405.122724.100/workers/nas/0/, but those appear to only be checkpoints made during the nas stage.

Here is the output from the end of my pipeline:

INFO:root:worker id [0], epoch [500/500], train step [180/196], loss [   0.035,    0.083], lr [   0.0010004],  time pre batch [0.601s] , total mean time per batch [0.615s]
INFO:root:worker id [0], epoch [500/500], train step [190/196], loss [   0.078,    0.083], lr [   0.0010004],  time pre batch [0.664s] , total mean time per batch [0.615s]
INFO:root:Finished the unified trainer successfully.
INFO:root:Update Success. step_name=nas, worker_id=0
INFO:root:Best values: [{'worker_id': 1, 'performance': {'accuracy': 0.8493999910354614}}]
INFO:root:Clean worker folder /gpfs2/scratch/ntraft/Development/vega/tasks/0405.122724.100/workers/nas.
INFO:root:------------------------------------------------
INFO:root:  Step: fully_train
INFO:root:------------------------------------------------
INFO:vega.core.pipeline.train_pipe_step:init TrainPipeStep...
INFO:vega.core.pipeline.train_pipe_step:TrainPipeStep started...
INFO:root:Model was created.
/users/n/t/ntraft/miniconda3/envs/vega/lib/python3.8/site-packages/thop/vision/basic_hooks.py:92: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
  kernel = torch.DoubleTensor([*(x[0].shape[2:])]) // torch.DoubleTensor(list((m.output_size,))).squeeze()
INFO:root:flops: 0.406770112 , params:3014.8920000000003
ERROR:root:Failed to run worker, id: 1, message: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [96, 144, 8, 8]], which is output 0 of ReluBackward0, is at version 2; expected version 0 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).
INFO:root:start evaluate process
INFO:vega.evaluator.evaluator:evalaute model without loading weights file
INFO:root:Model was created.
INFO:root:step [1/40], valid metric [[[tensor(0.0781, device='cuda:0'), tensor(0.5039, device='cuda:0')]]]
INFO:root:step [11/40], valid metric [[[tensor(0.0859, device='cuda:0'), tensor(0.4648, device='cuda:0')]]]
INFO:root:step [21/40], valid metric [[[tensor(0.1055, device='cuda:0'), tensor(0.4766, device='cuda:0')]]]
INFO:root:step [31/40], valid metric [[[tensor(0.0898, device='cuda:0'), tensor(0.4688, device='cuda:0')]]]
INFO:root:evaluator latency [106.40701539814472]
INFO:root:evaluate performance: {'accuracy': 0.1, 'accuracy_top1': 0.1, 'accuracy_top5': 0.5, 'latency': 106.40701539814472}
INFO:root:finished host evaluation, id: 1, performance: {'accuracy': 0.1, 'accuracy_top1': 0.1, 'accuracy_top5': 0.5, 'latency': 106.40701539814472}
INFO:root:------------------------------------------------
INFO:root:  Pipeline end.
INFO:root:
INFO:root:  task id: 0405.122724.100
INFO:root:  output folder: /gpfs2/scratch/ntraft/Development/vega/tasks/0405.122724.100/output
INFO:root:
INFO:root:  running time:
INFO:root:               nas:  22:37:42  [2022-04-05 12:28:51.638879 - 2022-04-06 11:06:34.411202]
INFO:root:       fully_train:  0:00:15  [2022-04-06 11:06:34.426958 - 2022-04-06 11:06:50.167165]
INFO:root:
INFO:root:  result:
INFO:root:    1:  {'flops': 0.406770112, 'params': 3014.8920000000003, 'accuracy': 0.1, 'accuracy_top1': 0.1, 'accuracy_top5': 0.5, 'latency': 106.40701539814472}
INFO:root:------------------------------------------------

Note that it does say "Failed to run worker", and also those numbers look suspiciously discrete to me ('accuracy_top1': 0.1, 'accuracy_top5': 0.5).

I also tried to resume the run with this command:

python vega/tools/run_pipeline.py examples/nas/cars/cars.yml --resume -t 0405.122724.100

But it seemed to start over from the beginning of the nas step.

I also tried re-running just the fully_train step. I did this by editing the config to delete the nas part so that I only had pipeline: [fully_train], then resuming as above. This yielded a similar error, but with a stack trace:

Traceback (most recent call last):
  File "/gpfs2/scratch/ntraft/Development/vega/vega/core/scheduler/local_master.py", line 64, in run
    worker.train_process()
  File "/gpfs2/scratch/ntraft/Development/vega/vega/common/wrappers.py", line 75, in wrapper
    r = func(self, *args, **kwargs)
  File "/gpfs2/scratch/ntraft/Development/vega/vega/trainer/trainer_base.py", line 136, in train_process
    self._train_loop()
  File "/gpfs2/scratch/ntraft/Development/vega/vega/trainer/trainer_base.py", line 277, in _train_loop
    self._train_epoch()
  File "/gpfs2/scratch/ntraft/Development/vega/vega/trainer/trainer_torch.py", line 109, in _train_epoch
    train_batch_output = self.train_step(batch)
  File "/gpfs2/scratch/ntraft/Development/vega/vega/trainer/trainer_torch.py", line 175, in _default_train_step
    loss.backward()
  File "/users/n/t/ntraft/miniconda3/envs/vega-amd/lib/python3.8/site-packages/torch/_tensor.py", line 307, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/users/n/t/ntraft/miniconda3/envs/vega-amd/lib/python3.8/site-packages/torch/autograd/__init__.py", line 154, in backward
    Variable._execution_engine.run_backward(
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [96, 144, 8, 8]], which is output 0 of ReluBackward0, is at version 2; expected version 0 instead. Hint:
 enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

And the numbers were slightly different:

evaluate performance: {'accuracy': 0.10006009615384616, 'accuracy_top1': 0.10006009615384616, 'accuracy_top5': 0.5001001602564102, 'latency': 333.22229012846947}

Finally, I re-ran one more time with torch.autograd.set_detect_anomaly(True) as the error suggests, and here is the new error I get (sorry, it's a long one):

MIOpen(HIP): Warning [SQLiteBase] Unable to read system database file:gfx906_60.kdb Performance may degrade
/users/n/t/ntraft/miniconda3/envs/vega-amd/lib/python3.8/site-packages/thop/vision/basic_hooks.py:92: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like
 the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor')
.
  kernel = torch.DoubleTensor([*(x[0].shape[2:])]) // torch.DoubleTensor(list((m.output_size,))).squeeze()
[2022-04-06 17:08:13] [INFO] flops: 0.406770112 , params:3014.8920000000003
[2022-04-06 17:08:13] [INFO] skip loading checkpoint file that do not exist, /gpfs2/scratch/ntraft/Development/vega/tasks/0405.122724.100/workers/fully_train/1/checkpoint.pth
[W python_anomaly_mode.cpp:104] Warning: Error detected in ReluBackward0. Traceback of forward call that caused the error:
  File "vega/tools/run_pipeline.py", line 239, in <module>
    main()
  File "vega/tools/run_pipeline.py", line 235, in main
    vega.run(config)
  File "/gpfs2/scratch/ntraft/Development/vega/vega/core/run.py", line 43, in run
    _run_pipeline()
  File "/gpfs2/scratch/ntraft/Development/vega/vega/core/run.py", line 72, in _run_pipeline
    Pipeline().run()
  File "/gpfs2/scratch/ntraft/Development/vega/vega/core/pipeline/pipeline.py", line 86, in run
    pipestep.do()
  File "/gpfs2/scratch/ntraft/Development/vega/vega/core/pipeline/train_pipe_step.py", line 56, in do
    self._train_multi_models(records)
  File "/gpfs2/scratch/ntraft/Development/vega/vega/core/pipeline/train_pipe_step.py", line 87, in _train_multi_models
    self.train_model(trainer)
  File "/gpfs2/scratch/ntraft/Development/vega/vega/core/pipeline/train_pipe_step.py", line 111, in train_model
    self.master.run(trainer, evaluator)
  File "/gpfs2/scratch/ntraft/Development/vega/vega/core/scheduler/local_master.py", line 64, in run
    worker.train_process()
  File "/gpfs2/scratch/ntraft/Development/vega/vega/common/wrappers.py", line 75, in wrapper
    r = func(self, *args, **kwargs)
  File "/gpfs2/scratch/ntraft/Development/vega/vega/trainer/trainer_base.py", line 136, in train_process
    self._train_loop()
  File "/gpfs2/scratch/ntraft/Development/vega/vega/trainer/trainer_base.py", line 277, in _train_loop
    self._train_epoch()
  File "/gpfs2/scratch/ntraft/Development/vega/vega/trainer/trainer_torch.py", line 109, in _train_epoch
    train_batch_output = self.train_step(batch)
  File "/gpfs2/scratch/ntraft/Development/vega/vega/trainer/trainer_torch.py", line 166, in _default_train_step
    output = self.model(input) if not isinstance(input, dict) else self.model(**input)
  File "/users/n/t/ntraft/miniconda3/envs/vega-amd/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/gpfs2/scratch/ntraft/Development/vega/vega/modules/operators/functions/pytorch_fn.py", line 183, in forward
    return self.call(inputs, *args, **kwargs)
  File "/gpfs2/scratch/ntraft/Development/vega/vega/networks/super_network.py", line 104, in call
    s0, s1 = s1, cell(s0, s1, weights, self.drop_path_prob)
  File "/users/n/t/ntraft/miniconda3/envs/vega-amd/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/gpfs2/scratch/ntraft/Development/vega/vega/modules/operators/functions/pytorch_fn.py", line 183, in forward
    return self.call(inputs, *args, **kwargs)
  File "/gpfs2/scratch/ntraft/Development/vega/vega/modules/operators/cell.py", line 118, in call
    h = op(states[inp])
  File "/users/n/t/ntraft/miniconda3/envs/vega-amd/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/gpfs2/scratch/ntraft/Development/vega/vega/modules/operators/functions/pytorch_fn.py", line 183, in forward
    return self.call(inputs, *args, **kwargs)
  File "/gpfs2/scratch/ntraft/Development/vega/vega/modules/operators/mix_ops.py", line 104, in call
    x = model(x)
  File "/users/n/t/ntraft/miniconda3/envs/vega-amd/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/gpfs2/scratch/ntraft/Development/vega/vega/modules/operators/functions/pytorch_fn.py", line 183, in forward
    return self.call(inputs, *args, **kwargs)
  File "/gpfs2/scratch/ntraft/Development/vega/vega/modules/operators/functions/pytorch_fn.py", line 164, in call
    output = model(output, *args, **kwargs)
  File "/users/n/t/ntraft/miniconda3/envs/vega-amd/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/gpfs2/scratch/ntraft/Development/vega/vega/modules/operators/functions/pytorch_fn.py", line 392, in forward
    return super().forward(x)
  File "/users/n/t/ntraft/miniconda3/envs/vega-amd/lib/python3.8/site-packages/torch/nn/modules/activation.py", line 98, in forward
    return F.relu(input, inplace=self.inplace)
  File "/users/n/t/ntraft/miniconda3/envs/vega-amd/lib/python3.8/site-packages/torch/nn/functional.py", line 1299, in relu
    result = torch.relu(input)
 (function _print_stack)
[2022-04-06 17:08:19] [ERROR] Traceback (most recent call last):
  File "/gpfs2/scratch/ntraft/Development/vega/vega/core/scheduler/local_master.py", line 64, in run
    worker.train_process()
  File "/gpfs2/scratch/ntraft/Development/vega/vega/common/wrappers.py", line 75, in wrapper
    r = func(self, *args, **kwargs)
  File "/gpfs2/scratch/ntraft/Development/vega/vega/trainer/trainer_base.py", line 136, in train_process
    self._train_loop()
  File "/gpfs2/scratch/ntraft/Development/vega/vega/trainer/trainer_base.py", line 277, in _train_loop
    self._train_epoch()
  File "/gpfs2/scratch/ntraft/Development/vega/vega/trainer/trainer_torch.py", line 109, in _train_epoch
    train_batch_output = self.train_step(batch)
  File "/gpfs2/scratch/ntraft/Development/vega/vega/trainer/trainer_torch.py", line 175, in _default_train_step
    loss.backward()
  File "/users/n/t/ntraft/miniconda3/envs/vega-amd/lib/python3.8/site-packages/torch/_tensor.py", line 307, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/users/n/t/ntraft/miniconda3/envs/vega-amd/lib/python3.8/site-packages/torch/autograd/__init__.py", line 154, in backward
    Variable._execution_engine.run_backward(
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [96, 144, 8, 8]], which is output 0 of ReluBackward0, is at version 2; expected version 0 instead. Hint:
 the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

@ntraft Hi, Thank you for using our automl tool.

If it runs successfully, there will be four files in the directory 'tasks/<task id>/workers/fully_train/0'.

 checkpoint.pth  desc_0.json  model_0.pth  performance_0.json

There is no *pth file in the 'tasks/0405.122724.100/output/fully_train/' directory because the runs failed. An error is reported here, and we are analyzing this problem.

ERROR:root:Failed to run worker, id: 1, message: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [96, 144, 8, 8]], which is output 0 of ReluBackward0, is at version 2; expected version 0 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

The evaluation results are not discrete. You can see the evaluation results of some intermediate steps.

INFO:root:step [1/40], valid metric [[[tensor(0.0781, device='cuda:0'), tensor(0.5039, device='cuda:0')]]]
INFO:root:step [11/40], valid metric [[[tensor(0.0859, device='cuda:0'), tensor(0.4648, device='cuda:0')]]]
INFO:root:step [21/40], valid metric [[[tensor(0.1055, device='cuda:0'), tensor(0.4766, device='cuda:0')]]]
INFO:root:step [31/40], valid metric [[[tensor(0.0898, device='cuda:0'), tensor(0.4688, device='cuda:0')]]]
hujiaxin0 commented 2 years ago

@ntraft Hi Can you tell me the version of torch you use, and do you use it on an NVIDIA card ?

ntraft commented 2 years ago

Thanks!

As a matter of fact, I reproduced this error on both types of nodes that I have access to:

hujiaxin0 commented 2 years ago

Thanks!

As a matter of fact, I reproduced this error on both types of nodes that I have access to:

  • Torch version: 1.11.0, CUDA version: 10.2, NVIDIA Tesla V100 (32GB) (originally ran here)
  • Torch version: 1.10.2+rocm4.2, AMD Radeon Instinct MI50 (32GB) (re-ran the fully_train here)

@ntraft At present, the latest version of Torch supported by the noah-vega is 1.5.0, and you can re-running with torch==1.5.0.

zhangjiajin commented 2 years ago

@ntraft

We found the following message in your log:

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [96, 144, 8, 8]], which is output 0 of ReluBackward0, is at version 2; expected version 0 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

You can try using an earlier version of the torch. This issue will be resolved in a later version.

ntraft commented 2 years ago

Seems to be working with torch 1.5.0 on NVIDIA Tesla V100.

I'd like to check with you to see if I am using the best settings for performance. By default, it looks like the fully_train step is set to run for 600 epochs (on CIFAR-10), but even after running for 48 hours I only had 393 epochs. However, I was only using 1 NVIDIA Tesla V100 (for me this gave total mean time per batch [0.760s]). If I restart the fully_train step with 2 GPUs and 4 CPUs, there is no improvement unless I use -pt true. But with -pt, now I get 0.240s. Going up to 4 GPUs only helped very slightly (0.235s). Going up to 8 GPUs, 8 CPUs didn't help at all after that (still 0.235s). I also made sure to use NUMEXPR_MAX_THREADS=<number-of-cores> each time.

Any other guidance on how to parallelize effectively (for both nas and fully_train)?

zhangjiajin commented 2 years ago

@ntraft

You can try using Horovod by modifying the TrainPipeStep to HorovodTrainStep: You only need to modify this paremeter in the original configuration file. You do not need to use the -pt parameter. Horovod needs to be installed before running. `pip3 install horovod'.

fully_train:
    pipe_step:
        # type: TrainPipeStep
        type: HorovodTrainStep

To retrain a previously searched network, modify the models_folder parameter and use an absolute path.

fully_train:
    pipe_step:
        # models_folder: "{local_base_path}/output/nas/"
        models_folder: "/gpfs2/scratch/ntraft/Development/vega/tasks/0405.122724.100/output/nas"
peter-ch commented 2 years ago

This might be related, I also tried CARS on a custom dataset and the NAS step works, but the fullytrain fails:

2022-06-11 19:55:02.401 INFO ------------------------------------------------ 2022-06-11 19:55:02.402 INFO Step: fullytrain 2022-06-11 19:55:02.404 INFO ------------------------------------------------ 2022-06-11 19:55:02.410 INFO init TrainPipeStep... 2022-06-11 19:55:02.411 INFO TrainPipeStep started... 2022-06-11 19:55:02.561 INFO Model was created. 2022-06-11 19:55:03.371 WARNING model statics failed, ex=conv2d() received an invalid combination of arguments - got (NoneType, Parameter, NoneType, tuple, tuple, tuple, int), but expected one of:

2022-06-11 19:55:04.665 ERROR Failed to run worker, id: 2, message: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [96, 144, 8, 8]], which is output 0 of ReluBackward0, is at version 2; expected version 0 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck! 2022-06-11 19:55:04.674 INFO start evaluate process 2022-06-11 19:55:04.677 INFO evalaute model without loading weights file 2022-06-11 19:55:04.813 INFO Model was created. /usr/local/lib/python3.7/dist-packages/torch/autograd/init.py:175: UserWarning: Error detected in ReluBackward0. Traceback of forward call that caused the error: File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/usr/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py", line 16, in app.launch_new_instance() File "/usr/local/lib/python3.7/dist-packages/traitlets/config/application.py", line 846, in launch_instance app.start() File "/usr/local/lib/python3.7/dist-packages/ipykernel/kernelapp.py", line 499, in start self.io_loop.start() File "/usr/local/lib/python3.7/dist-packages/tornado/platform/asyncio.py", line 132, in start self.asyncio_loop.run_forever() File "/usr/lib/python3.7/asyncio/base_events.py", line 541, in run_forever self._run_once() File "/usr/lib/python3.7/asyncio/base_events.py", line 1786, in _run_once handle._run() File "/usr/lib/python3.7/asyncio/events.py", line 88, in _run self._context.run(self._callback, self._args) File "/usr/local/lib/python3.7/dist-packages/tornado/ioloop.py", line 758, in _run_callback ret = callback() File "/usr/local/lib/python3.7/dist-packages/tornado/stack_context.py", line 300, in null_wrapper return fn(args, kwargs) File "/usr/local/lib/python3.7/dist-packages/zmq/eventloop/zmqstream.py", line 661, in self.io_loop.add_callback(lambda: self._handle_events(self.socket, 0)) File "/usr/local/lib/python3.7/dist-packages/zmq/eventloop/zmqstream.py", line 577, in _handle_events self._handle_recv() File "/usr/local/lib/python3.7/dist-packages/zmq/eventloop/zmqstream.py", line 606, in _handle_recv self._run_callback(callback, msg) File "/usr/local/lib/python3.7/dist-packages/zmq/eventloop/zmqstream.py", line 556, in _run_callback callback(*args, *kwargs) File "/usr/local/lib/python3.7/dist-packages/tornado/stack_context.py", line 300, in null_wrapper return fn(args, kwargs) File "/usr/local/lib/python3.7/dist-packages/ipykernel/kernelbase.py", line 283, in dispatcher return self.dispatch_shell(stream, msg) File "/usr/local/lib/python3.7/dist-packages/ipykernel/kernelbase.py", line 233, in dispatch_shell handler(stream, idents, msg) File "/usr/local/lib/python3.7/dist-packages/ipykernel/kernelbase.py", line 399, in execute_request user_expressions, allow_stdin) File "/usr/local/lib/python3.7/dist-packages/ipykernel/ipkernel.py", line 208, in do_execute res = shell.run_cell(code, store_history=store_history, silent=silent) File "/usr/local/lib/python3.7/dist-packages/ipykernel/zmqshell.py", line 537, in run_cell return super(ZMQInteractiveShell, self).run_cell(*args, kwargs) File "/usr/local/lib/python3.7/dist-packages/IPython/core/interactiveshell.py", line 2718, in run_cell interactivity=interactivity, compiler=compiler, result=result) File "/usr/local/lib/python3.7/dist-packages/IPython/core/interactiveshell.py", line 2828, in run_ast_nodes if self.run_code(code, result): File "/usr/local/lib/python3.7/dist-packages/IPython/core/interactiveshell.py", line 2882, in run_code exec(code_obj, self.user_global_ns, self.user_ns) File "", line 1, in vega.run("my.yml") File "/usr/local/lib/python3.7/dist-packages/vega/core/run.py", line 43, in run _run_pipeline() File "/usr/local/lib/python3.7/dist-packages/vega/core/run.py", line 72, in _run_pipeline Pipeline().run() File "/usr/local/lib/python3.7/dist-packages/vega/core/pipeline/pipeline.py", line 86, in run pipestep.do() File "/usr/local/lib/python3.7/dist-packages/vega/core/pipeline/train_pipe_step.py", line 56, in do self._train_multi_models(records) File "/usr/local/lib/python3.7/dist-packages/vega/core/pipeline/train_pipe_step.py", line 87, in _train_multi_models self.train_model(trainer) File "/usr/local/lib/python3.7/dist-packages/vega/core/pipeline/train_pipe_step.py", line 111, in train_model self.master.run(trainer, evaluator) File "/usr/local/lib/python3.7/dist-packages/vega/core/scheduler/local_master.py", line 64, in run worker.train_process() File "/usr/local/lib/python3.7/dist-packages/vega/common/wrappers.py", line 75, in wrapper r = func(self, args, kwargs) File "/usr/local/lib/python3.7/dist-packages/vega/trainer/trainer_base.py", line 136, in train_process self._train_loop() File "/usr/local/lib/python3.7/dist-packages/vega/trainer/trainer_base.py", line 277, in _train_loop self._train_epoch() File "/usr/local/lib/python3.7/dist-packages/vega/trainer/trainer_torch.py", line 117, in _train_epoch train_batch_output = self.train_step(batch) File "/usr/local/lib/python3.7/dist-packages/vega/trainer/trainer_torch.py", line 174, in _default_train_step output = self.model(input) if not isinstance(input, dict) else self.model(input) File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(input, kwargs) File "/usr/local/lib/python3.7/dist-packages/vega/modules/operators/functions/pytorch_fn.py", line 183, in forward return self.call(inputs, *args, kwargs) File "/usr/local/lib/python3.7/dist-packages/vega/networks/super_network.py", line 104, in call s0, s1 = s1, cell(s0, s1, weights, self.drop_path_prob) File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(*input, *kwargs) File "/usr/local/lib/python3.7/dist-packages/vega/modules/operators/functions/pytorch_fn.py", line 183, in forward return self.call(inputs, args, kwargs) File "/usr/local/lib/python3.7/dist-packages/vega/modules/operators/cell.py", line 118, in call h = op(states[inp]) File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(*input, kwargs) File "/usr/local/lib/python3.7/dist-packages/vega/modules/operators/functions/pytorch_fn.py", line 183, in forward return self.call(inputs, *args, *kwargs) File "/usr/local/lib/python3.7/dist-packages/vega/modules/operators/mix_ops.py", line 104, in call x = model(x) File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(input, kwargs) File "/usr/local/lib/python3.7/dist-packages/vega/modules/operators/functions/pytorch_fn.py", line 183, in forward return self.call(inputs, *args, kwargs) File "/usr/local/lib/python3.7/dist-packages/vega/modules/operators/functions/pytorch_fn.py", line 164, in call output = model(output, *args, *kwargs) File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(input, kwargs) File "/usr/local/lib/python3.7/dist-packages/vega/modules/operators/functions/pytorch_fn.py", line 392, in forward return super().forward(x) File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/activation.py", line 98, in forward return F.relu(input, inplace=self.inplace) File "/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py", line 1442, in relu result = torch.relu(input) (Triggered internally at ../torch/csrc/autograd/python_anomaly_mode.cpp:104.) allow_unreachable=True, accumulate_grad=True) # Calls into the C++ engine to run the backward pass

zhangjiajin commented 2 years ago

Please provide the version of the torch, cuda, python, and as well as the YAML configuration file. @peter-ch

Isaspinach commented 2 months ago

Hello,

Thanks so much for this rich AutoML project! I enjoyed the CARS paper and I'm excited to try it out.

The basic example given here is not working properly:

vega ./nas/cars/cars.yml

The first error I got was a basic syntax error introduced in Release 1.8.2, which I fixed in my fork here.

Now, I have an error which is not as simple for me to figure out. Here is some of the output:

MIOpen(HIP): Warning [SQLiteBase] Unable to read system database file:gfx906_60.kdb Performance may degrade
MIOpen(HIP): Warning [FindWinogradSolutions] /MIOpen/src/sqlite_db.cpp:107: open memvfs: unable to open database file
MIOpen(HIP): Warning [FindDataDirectSolutions] /MIOpen/src/sqlite_db.cpp:107: open memvfs: unable to open database file
MIOpen(HIP): Warning [FindDataImplicitGemmSolutions] /MIOpen/src/sqlite_db.cpp:107: open memvfs: unable to open database file
MIOpen(HIP): Warning [FindFftSolutions] /MIOpen/src/sqlite_db.cpp:107: open memvfs: unable to open database file
MIOpen Error: /MIOpen/src/sqlite_db.cpp:107: open memvfs: unable to open database file
ERROR:root:Failed to run worker, id: 0, message: miopenStatusInternalError
INFO:root:Update Success. step_name=nas, worker_id=0
INFO:root:waiting for the workers [0] to finish
INFO:root:Best values: []
WARNING:root:Failed to dump pareto front records, report is emplty.
INFO:root:------------------------------------------------
INFO:root:  Step: fully_train
INFO:root:------------------------------------------------
INFO:vega.core.pipeline.train_pipe_step:init TrainPipeStep...
INFO:vega.core.pipeline.train_pipe_step:TrainPipeStep started...
WARNING:root:Failed to dump records, report is emplty.

I'm not sure if this has anything to do with MPI, but my mpirun simply prints a help message without error.

I tried running this two ways, and got the same error. One way:

$  python vega/tools/run_pipeline.py examples/nas/cars/cars.yml

Second way:

$  pip install .
$  vega examples/nas/cars/cars.yml

Hi, I'm new to ML in general, I was wondering if you got the cars example running correctly. Did you have to change the yml in some way to get it working?

ntraft commented 1 month ago

@Isaspinach I abandoned this project. I don't think I made any changes to the yml? Not sure though. I did have to make one small code change to fix an error.

You are probably better off using a more recent and well-supported method though. I don't really know if there are any stable NAS libraries out there. I feel like research in that area has been less active in recent years.