microsoft / DeepSpeed-MII

MII makes low-latency and high-throughput inference possible, powered by DeepSpeed.
Apache License 2.0
1.87k stars 175 forks source link

serving OPT 175B #149

Open Emerald01 opened 1 year ago

Emerald01 commented 1 year ago

According to the Support list, it seems that DS MII only supports OPT up to 66B, what does it mean to 175B model? Does that mean there is no kernel injection available for 175B model?

If so I guess the model splitting to multiple devices are not available, then GPU cannot load this because it is too large. BTW, if this is the case, I would like to see if Zero Inference is a good alternative, I see this example for gpt2, https://github.com/microsoft/DeepSpeed-MII/blob/main/examples/local/text-generation-zero-example.py but it has many configurations that are very granular to decide the GPU blocks or bucket sizes, they do not have docstring to explain, I am not sure if this config is generic to apply to OPT-175B for example.

Thank you!

mrwyattii commented 1 year ago

@Emerald01 kernel injection should be supported for the OPT 175B variant. However, I don't currently have access to the model in order to test that. You may need to enable load_with_sys_mem in order to load the model without seeing an OOM error (https://github.com/microsoft/DeepSpeed-MII/blob/9ec2f12baf87e950ea48b290c7c3b2b9c59549cd/mii/config.py#L40). However, this would still require that your system have a very large amount of system memory available.

I think the proper solution is to enable loading the larger OPT models with meta tensors. This is possible with DeepSpeed-Inference, but we don't have this enabled in MII currently.

Additionally, configs for ZeRO are the same as in DeepSpeed (we pass the config dict directly to DeepSpeed in MII). You can find explanation for these configs here: https://www.deepspeed.ai/docs/config-json/#zero-optimizations-for-fp16-training

mrwyattii commented 1 year ago

Update on this: meta tensor support with OPT is now possible in MII with #199

Here's an example with OPT-66B:

import mii
mii_configs = {"tensor_parallel": 8, "dtype": "fp16", "meta_tensor": True}
mii.deploy(task="text-generation", model="facebook/opt-66b", deployment_name="opt", mii_config=mii_configs)

If you have the 175B weights in a local directory, you can provide that directory path for model="path/to/weights" and it should work. Please let me know if you run into any problems.

moussaba commented 1 year ago

Good Day,

Following up on the conversation above. I have been learning to use deepspeed-mii specifically when it comes to the ZeRO capabilities. I have beem trying to serve the gpx-neox-20b using a 3090 and enabling either CPU offload or nvme offload. I keep on having OOM errors like the one below. Is what I am trying to do even feasible? Our system has 256GB of RAM. See my ds_config below the error message.

return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 288.00 MiB (GPU 0; 23.69 GiB total capacity; 22.91 GiB already allocated; 41.75 MiB free; 22.92 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
[2023-07-06 23:47:47,324] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 172184
[2023-07-06 23:47:47,325] [ERROR] [launch.py:321:sigkill_handler] ['/home/darth/mldev/ngc/inference/GPT-NeoX/bin/python', '-m', 'mii.launch.multi_gpu_server', '--task-name', 'text-generation', '--model', 'EleutherAI/gpt-neox-20b', '--model-path', '/home/darth/mldev/huggingface/hub', '--port', '50050', '--provider', 'hugging-face', '--config', 'eyJ0ZW5zb3JfcGFyYWxsZWwiOiAxLCAicG9ydF9udW1iZXIiOiA1MDA1MCwgImR0eXBlIjogInRvcmNoLmZsb2F0MTYiLCAibG9hZF93aXRoX3N5c19tZW0iOiB0cnVlLCAiZW5hYmxlX2N1ZGFfZ3JhcGgiOiBmYWxzZSwgImNoZWNrcG9pbnRfZGljdCI6IG51bGwsICJkZXBsb3lfcmFuayI6IFswXSwgInRvcmNoX2Rpc3RfcG9ydCI6IDI5NTAwLCAiaGZfYXV0aF90b2tlbiI6IG51bGwsICJyZXBsYWNlX3dpdGhfa2VybmVsX2luamVjdCI6IHRydWUsICJwcm9maWxlX21vZGVsX3RpbWUiOiBmYWxzZSwgInNraXBfbW9kZWxfY2hlY2siOiBmYWxzZX0=', '--ds-zero', '--ds-config', '/home/darth/mldev/huggingface/temp_config.json'] exits with return code = 1
ds_config = {
    "fp16": {
        "enabled": True
    },
    "bf16": {
        "enabled": False
    },
    "aio": {
        "block_size": 262144,
        "queue_depth": 16,
        "pin_memory": True,
        "thread_count": 8,
        "single_submit": False,
        "overlap_events": True
    },
    "zero_optimization": {
        "stage": 3,
        "offload_param": {
            "device": "cpu",
        },
        "overlap_comm": True,
        "contiguous_gradients": True,
        "reduce_bucket_size": model_hidden_size * model_hidden_size,
        "stage3_prefetch_bucket_size": 0.1 * model_hidden_size * model_hidden_size,
        "stage3_max_live_parameters": 1e7,
        "stage3_max_reuse_distance": 1e7,
        "stage3_param_persistence_threshold": 10 * model_hidden_size
    }
}