microsoft / DeepSpeedExamples

Example models using DeepSpeed
Apache License 2.0
6.1k stars 1.04k forks source link

[Error] AutoTune: `connect to host localhost port 22: Connection refused` #894

Open wqw547243068 opened 7 months ago

wqw547243068 commented 7 months ago

Error message:

localhost: ssh: connect to host localhost port 22: Connection refused
pdsh@mla****-worker: localhost: ssh exited with exit code 255
[2024-04-20 17:29:09,147] [INFO] [scheduler.py:430:clean_up] Done cleaning up exp_id = 0 on the following workers: localhost
[2024-04-20 17:29:09,147] [INFO] [scheduler.py:393:run_experiment] Done running exp_id = 0, exp_name = profile_model_info, with resource = localhost:0
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:55<00:00, 55.37s/it]
[2024-04-20 17:29:14,155] [ERROR] [autotuner.py:699:model_info_profile_run] The model is not runnable with DeepSpeed with error = unrecognized arguments: eyJ0cmFpbl9iYXRjaF9zaXplIjogMjU2LCAidHJhaW5fbWljcm9fYmF0Y2hfc2l6ZV9wZXJfZ3B1IjogMSwgIm9wdGltaXplciI6IHsidHlwZSI6ICJBZGFtIiwgInBhcmFtcyI6IHsibHIiOiAwLjAwMSwgImJldGFzIjogWzAuOSwgMC45OTldLCAiZXBzIjogMWUtMDh9fSwgInN0ZXBzX3Blcl9wcmludCI6IDEwLCAid2FsbF9jbG9ja19icmVha2Rvd24iOiBmYWxzZSwgIm5ub2RlIjogMSwgImF1dG90dW5pbmciOiB7ImVuYWJsZWQiOiB0cnVlLCAibW9kZWxfaW5mb19wYXRoIjogImF1dG90dW5pbmdfcmVzdWx0cy9wcm9maWxlX21vZGVsX2luZm8vbW9kZWxfaW5mby5qc29uIiwgIm1vZGVsX2luZm8iOiB7InByb2ZpbGUiOiB0cnVlfSwgIm1ldHJpY19wYXRoIjogImF1dG90dW5pbmdfcmVzdWx0cy9wcm9maWxlX21vZGVsX2luZm8vbWV0cmljcy5qc29uIn0sICJ6ZXJvX29wdGltaXphdGlvbiI6IHsic3RhZ2UiOiAzfSwgIm1lbW9yeV9icmVha19kb3duIjogZmFsc2V9 --per_device_train_batch_size 1

Code:

deepspeed --autotuning tune train.py -p 1 --steps=200 --deepspeed ds_config.json

ds_config.json

 {
  "train_batch_size" : 256,
  "train_micro_batch_size_per_gpu" : 8,

   "optimizer": {
    "type": "Adam",
    "params": {
      "lr": 0.001,
      "betas": [
        0.9,
        0.999
      ],
      "eps": 1e-8
    }
  },

  "steps_per_print" : 10,
  "wall_clock_breakdown" : false,

  "nnode":1,
  "autotuning": {
      "enabled": true,
      "arg_mappings": {
          "train_micro_batch_size_per_gpu": "--per_device_train_batch_size",
          "gradient_accumulation_steps": "--gradient_accumulation_steps"
      }
  }
 }