Open stas00 opened 3 years ago
Is there any way to set those params? I kept getting OOM but I don't understand the parameter in config. The documentation also didn't help much.
I am also looking for guidelines to set those parmas. They do seem to give meaningful impact on my server setup in terms of performance, but setting those values to high kills the whole system.
I am also looking for guidelines to set those parmas. They do seem to give meaningful impact on my server setup in terms of performance, but setting those values to high kills the whole system.
Same issue here. Even reading the paper doesn't help at all. Is there a documentation explaining what these params do yet?
https://github.com/microsoft/DeepSpeed/issues/998 tackles the
aio
param section, but we still have no user guide for the new "offload_optimizer" and "offload_param" sections. We have:other than
device
,nvme_path
andpin_memory
which are pretty obvious, the rest have super-terse descriptions and a user will have no idea how to configure those. Let's write a guide to how these values should be chosen.I copied the descriptions and defaults that already exist and tried to ask the right questions, so if you could answer those I think that would be a great start.
Thank you!
Optimizer
buffer_count
: default4
: Number of buffers in buffer pool for optimizer state offloading to NVMe. This should be at least the number of states maintained per parameter by the optimizer. For example, Adam optimizer has 4 states (parameter, gradient, momentum, and variance)Q: why "at least" - is it more efficient to have it bigger? Q: what's the impact on memory footprint (CPU/NVMe)
fast_init
: defaultfalse
. Enable fast optimizer initialization when offloading to NVMe.Q: why is it false by default?
Param
buffer_count
: default5
: Number of buffers in buffer pool for parameter offloading to NVMe.Q: why 5, what are the correlations to other params?
buffer_size
: default1e8
: Size of buffers in buffer pool for parameter offloading to NVMe.Q: how do we get to this number and how it correlates with other config params? Q: what's the impact on memory footprint (CPU/NVMe)
max_in_cpu
: default1e9
: Number of parameter elements to maintain in CPU memory when offloading to NVMe is enabledQ: how do we get to this number and how it correlates with other config params? Q: what's the impact on memory footprint (CPU/NVMe)