Open wqw547243068 opened 7 months ago
Error message:
localhost: ssh: connect to host localhost port 22: Connection refused pdsh@mla****-worker: localhost: ssh exited with exit code 255 [2024-04-20 17:29:09,147] [INFO] [scheduler.py:430:clean_up] Done cleaning up exp_id = 0 on the following workers: localhost [2024-04-20 17:29:09,147] [INFO] [scheduler.py:393:run_experiment] Done running exp_id = 0, exp_name = profile_model_info, with resource = localhost:0 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:55<00:00, 55.37s/it] [2024-04-20 17:29:14,155] [ERROR] [autotuner.py:699:model_info_profile_run] The model is not runnable with DeepSpeed with error = unrecognized arguments: eyJ0cmFpbl9iYXRjaF9zaXplIjogMjU2LCAidHJhaW5fbWljcm9fYmF0Y2hfc2l6ZV9wZXJfZ3B1IjogMSwgIm9wdGltaXplciI6IHsidHlwZSI6ICJBZGFtIiwgInBhcmFtcyI6IHsibHIiOiAwLjAwMSwgImJldGFzIjogWzAuOSwgMC45OTldLCAiZXBzIjogMWUtMDh9fSwgInN0ZXBzX3Blcl9wcmludCI6IDEwLCAid2FsbF9jbG9ja19icmVha2Rvd24iOiBmYWxzZSwgIm5ub2RlIjogMSwgImF1dG90dW5pbmciOiB7ImVuYWJsZWQiOiB0cnVlLCAibW9kZWxfaW5mb19wYXRoIjogImF1dG90dW5pbmdfcmVzdWx0cy9wcm9maWxlX21vZGVsX2luZm8vbW9kZWxfaW5mby5qc29uIiwgIm1vZGVsX2luZm8iOiB7InByb2ZpbGUiOiB0cnVlfSwgIm1ldHJpY19wYXRoIjogImF1dG90dW5pbmdfcmVzdWx0cy9wcm9maWxlX21vZGVsX2luZm8vbWV0cmljcy5qc29uIn0sICJ6ZXJvX29wdGltaXphdGlvbiI6IHsic3RhZ2UiOiAzfSwgIm1lbW9yeV9icmVha19kb3duIjogZmFsc2V9 --per_device_train_batch_size 1
Code:
run.sh
deepspeed --autotuning tune train.py -p 1 --steps=200 --deepspeed ds_config.json
ds_config.json
{ "train_batch_size" : 256, "train_micro_batch_size_per_gpu" : 8, "optimizer": { "type": "Adam", "params": { "lr": 0.001, "betas": [ 0.9, 0.999 ], "eps": 1e-8 } }, "steps_per_print" : 10, "wall_clock_breakdown" : false, "nnode":1, "autotuning": { "enabled": true, "arg_mappings": { "train_micro_batch_size_per_gpu": "--per_device_train_batch_size", "gradient_accumulation_steps": "--gradient_accumulation_steps" } } }
Error message:
Code:
run.sh
ds_config.json