Open Emerald01 opened 1 year ago
@Emerald01 kernel injection should be supported for the OPT 175B variant. However, I don't currently have access to the model in order to test that. You may need to enable load_with_sys_mem
in order to load the model without seeing an OOM error (https://github.com/microsoft/DeepSpeed-MII/blob/9ec2f12baf87e950ea48b290c7c3b2b9c59549cd/mii/config.py#L40). However, this would still require that your system have a very large amount of system memory available.
I think the proper solution is to enable loading the larger OPT models with meta tensors. This is possible with DeepSpeed-Inference, but we don't have this enabled in MII currently.
Additionally, configs for ZeRO are the same as in DeepSpeed (we pass the config dict directly to DeepSpeed in MII). You can find explanation for these configs here: https://www.deepspeed.ai/docs/config-json/#zero-optimizations-for-fp16-training
Update on this: meta tensor support with OPT is now possible in MII with #199
Here's an example with OPT-66B:
import mii
mii_configs = {"tensor_parallel": 8, "dtype": "fp16", "meta_tensor": True}
mii.deploy(task="text-generation", model="facebook/opt-66b", deployment_name="opt", mii_config=mii_configs)
If you have the 175B weights in a local directory, you can provide that directory path for model="path/to/weights"
and it should work. Please let me know if you run into any problems.
Good Day,
Following up on the conversation above. I have been learning to use deepspeed-mii specifically when it comes to the ZeRO capabilities. I have beem trying to serve the gpx-neox-20b using a 3090 and enabling either CPU offload or nvme offload. I keep on having OOM errors like the one below. Is what I am trying to do even feasible? Our system has 256GB of RAM. See my ds_config below the error message.
return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 288.00 MiB (GPU 0; 23.69 GiB total capacity; 22.91 GiB already allocated; 41.75 MiB free; 22.92 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
[2023-07-06 23:47:47,324] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 172184
[2023-07-06 23:47:47,325] [ERROR] [launch.py:321:sigkill_handler] ['/home/darth/mldev/ngc/inference/GPT-NeoX/bin/python', '-m', 'mii.launch.multi_gpu_server', '--task-name', 'text-generation', '--model', 'EleutherAI/gpt-neox-20b', '--model-path', '/home/darth/mldev/huggingface/hub', '--port', '50050', '--provider', 'hugging-face', '--config', 'eyJ0ZW5zb3JfcGFyYWxsZWwiOiAxLCAicG9ydF9udW1iZXIiOiA1MDA1MCwgImR0eXBlIjogInRvcmNoLmZsb2F0MTYiLCAibG9hZF93aXRoX3N5c19tZW0iOiB0cnVlLCAiZW5hYmxlX2N1ZGFfZ3JhcGgiOiBmYWxzZSwgImNoZWNrcG9pbnRfZGljdCI6IG51bGwsICJkZXBsb3lfcmFuayI6IFswXSwgInRvcmNoX2Rpc3RfcG9ydCI6IDI5NTAwLCAiaGZfYXV0aF90b2tlbiI6IG51bGwsICJyZXBsYWNlX3dpdGhfa2VybmVsX2luamVjdCI6IHRydWUsICJwcm9maWxlX21vZGVsX3RpbWUiOiBmYWxzZSwgInNraXBfbW9kZWxfY2hlY2siOiBmYWxzZX0=', '--ds-zero', '--ds-config', '/home/darth/mldev/huggingface/temp_config.json'] exits with return code = 1
ds_config = {
"fp16": {
"enabled": True
},
"bf16": {
"enabled": False
},
"aio": {
"block_size": 262144,
"queue_depth": 16,
"pin_memory": True,
"thread_count": 8,
"single_submit": False,
"overlap_events": True
},
"zero_optimization": {
"stage": 3,
"offload_param": {
"device": "cpu",
},
"overlap_comm": True,
"contiguous_gradients": True,
"reduce_bucket_size": model_hidden_size * model_hidden_size,
"stage3_prefetch_bucket_size": 0.1 * model_hidden_size * model_hidden_size,
"stage3_max_live_parameters": 1e7,
"stage3_max_reuse_distance": 1e7,
"stage3_param_persistence_threshold": 10 * model_hidden_size
}
}
According to the Support list, it seems that DS MII only supports OPT up to 66B, what does it mean to 175B model? Does that mean there is no kernel injection available for 175B model?
If so I guess the model splitting to multiple devices are not available, then GPU cannot load this because it is too large. BTW, if this is the case, I would like to see if Zero Inference is a good alternative, I see this example for gpt2, https://github.com/microsoft/DeepSpeed-MII/blob/main/examples/local/text-generation-zero-example.py but it has many configurations that are very granular to decide the GPU blocks or bucket sizes, they do not have docstring to explain, I am not sure if this config is generic to apply to OPT-175B for example.
Thank you!