Open andre-bauer opened 1 year ago
The max_in_cpu
flag is activated only for training, not for inference.
That means if I have 300G of ram for a 301G model there is no way to offload only the 1G of params to nvme in inference 🤔 ? I have to offload the full 301G 😱 ?
@andre-bauer, unfortunately, you are correct. For inference, we currently don't support splitting the offload over dram and nvme. In theory, max_in_cpu
could be ported over to the inference code path, but we just have not had the bandwidth to do so :(.
An alternative would be to partially offload 300G to dram, while keeping 1G in HBM. You do this by combining model_persistence_threshold and param_persistence_threshold. You should set model_persistence_threshold
to model partition size to pin in HBM (e.g., 1e9) and set param_persistence_threshold
greater than the largest layer size (e.g., 2e8 for opt-66b).
@tjruwase Hi, I'm a new developer here. I'm interested in contributing to this issue. Is this feature still needed and do you think this will be a good first issue? Thanks
@mimiliaogo, yes this would be a good and useful first issue. Thanks!
Describe the bug I evaluate
OPT-66B
with Zero3 and set offloading tonvme
which works fine, but I also increasedmax_in_cpu
to 100Gprinted as
In float16 I would expect that up to 200GB of host memory are used but I get a usage of ~16GB of my available 300GB. While setting in this case to
"cpu"
instead of"nvme"
I get the same behavior for models that exceed 300GB likebloom-176B
Am I missing something? How can you use nvme and cpu properly.ds_config
ds_report output