microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
34.77k stars 4.05k forks source link

[BUG] offloading section in config file never carried to autotuner #3379

Open cxxz opened 1 year ago

cxxz commented 1 year ago

Describe the bug I tried to enable offloading in the zero2_auto.json file with the following lines,

  "zero_optimization": {
    "stage": 2,
    "offload_optimizer": {
        "device": "cpu",
        "pin_memory": true
    },
    ...
  }

It works fine for normal runs without the --autotune flag. However, once I use deepspeed --autotune, none of the automatically generated .json files have the offload_optimizer section, i.e. one sample json file generated by autotuner

This contradicts what is stated in the README: "Currently, the DeepSpeed Autotuner does not tune offloading behaviors but instead uses the values defined in the offload section of the DeepSpeed configuration file." [https://github.com/microsoft/DeepSpeed/tree/master/deepspeed/autotuning#offloading-and-nvme]

To Reproduce

git clone https://github.com/cxxz/llama_deepspeed_autotune.git
cd llama_deepspeed_autotune
./run_autotune_llama_4A100.sh

Expected behavior All generated ds_config.json during the search should have the offload section.

System info (please complete the following information):

cli99 commented 1 year ago

@cxxz , can you try the least transformers and accelerate library? I cannot reproduce the error on my end. It does include the offload section in my test. thanks

cxxz commented 1 year ago

Thank you for responding to my request. I have done pip install git+https://github.com/huggingface/transformers and pip install git+https://github.com/huggingface/accelerate, as confirmed by pip_freeze. However, upon rerunning run_autotune_llama_4A100.sh, the offload section still failed to be transferred to the ds_config.json files in all attempts. The complete log has been documented in the repository. Any hint on what settings might have gone wrong?