microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
35.67k stars 4.15k forks source link

[doc] configuring `offload_*` param sections #1005

Open stas00 opened 3 years ago

stas00 commented 3 years ago

https://github.com/microsoft/DeepSpeed/issues/998 tackles the aio param section, but we still have no user guide for the new "offload_optimizer" and "offload_param" sections. We have:

            "offload_optimizer": {​​​​​
                "device": "nvme",
                "nvme_path": "/local_nvme",
                "pin_memory": true,
                "buffer_count": 4,
                "fast_init": false
            }​​​​​,
            "offload_param": {​​​​​
                "device": "nvme",
                "nvme_path": "/local_nvme",
                "pin_memory": true,
                "buffer_count": 5,
                "buffer_size": 1e8,
                "max_in_cpu": 1e9
            }​​​​​

other than device, nvme_path and pin_memory which are pretty obvious, the rest have super-terse descriptions and a user will have no idea how to configure those. Let's write a guide to how these values should be chosen.

I copied the descriptions and defaults that already exist and tried to ask the right questions, so if you could answer those I think that would be a great start.

Thank you!

Optimizer

Q: why "at least" - is it more efficient to have it bigger? Q: what's the impact on memory footprint (CPU/NVMe)

Q: why is it false by default?

Param

Q: why 5, what are the correlations to other params?

Q: how do we get to this number and how it correlates with other config params? Q: what's the impact on memory footprint (CPU/NVMe)

Q: how do we get to this number and how it correlates with other config params? Q: what's the impact on memory footprint (CPU/NVMe)

fahadh4ilyas commented 1 year ago

Is there any way to set those params? I kept getting OOM but I don't understand the parameter in config. The documentation also didn't help much.

koesnow commented 1 year ago

I am also looking for guidelines to set those parmas. They do seem to give meaningful impact on my server setup in terms of performance, but setting those values to high kills the whole system.

xuanyaoming commented 9 months ago

I am also looking for guidelines to set those parmas. They do seem to give meaningful impact on my server setup in terms of performance, but setting those values to high kills the whole system.

Same issue here. Even reading the paper doesn't help at all. Is there a documentation explaining what these params do yet?