How do build a web api for deepspeed inference

vamsikrishnav commented 1 year ago

Hi Mayank,

Really nice to see your work here. appreciate what you are doing here for the community. I have a question for you. Based on your code I want to build a minimal api server using sanic. when i use the bloom-ds-inference.py script it runs well. how ever when i build some api related code using sanic i see that the server spawns automatically on all the GPU's. How do I get around this? is there a specific approach i need to take here. Your help and input would be highly appreciated.

Regards, Vamsi

mayank31398 commented 1 year ago

So, I am not familiar with sanic. @vamsikrishnav But you can always use the inference_server module in this repo :) The provided dockerfile runs a 560m model and I have tested it with bloom-176b too.

The makefile has the commands to host various models. https://github.com/huggingface/transformers-bloom-inference/blob/b14224d180f365c30ce0e1ba053087eac7f4ee5c/Makefile#L13-L23 Please look at the dockerfile for the dependencies. Please also note that ds-inference is currently incompatible with pytorch 1.13. Please use torch==1.12.1.

This is the launch command: https://github.com/huggingface/transformers-bloom-inference/blob/b14224d180f365c30ce0e1ba053087eac7f4ee5c/Dockerfile#L63-L68 You can change to make bloom-176b

mayank31398 commented 1 year ago

But to answer you question, its a bit of work to do what you are looking for. The code that I have created does it in 2 scripts. There is a front-end HTTP server which can used by a user and there is a back-end GRPC server running on 8 different processes (each for a GPU). The front end server fires the query received by the user to each of these 8 GRPC hosts.

This is where another subprocess for the 8 GRPC servers is launched: https://github.com/huggingface/transformers-bloom-inference/blob/b14224d180f365c30ce0e1ba053087eac7f4ee5c/inference_server/model_handler/deployment.py#L110-L135

mayank31398 commented 1 year ago

A UI is also launched. Look at readme 🤗

vamsikrishnav commented 1 year ago

the docker build is failing for the UI..it seems the docker file needs to be updated?

vamsikrishnav commented 1 year ago

thanks for your input. I will take a look definitely!

mayank31398 commented 1 year ago

Can you post the error @vamsikrishnav

vamsikrishnav commented 1 year ago

sorry the docker run fails..not the build..

Cloning into 'transformers-bloom-inference'...
make: *** No rule to make target 'ui'.  Stop.

ERROR conda.cli.main_run:execute(47): `conda run /bin/bash -c git clone https://github.com/huggingface/transformers-bloom-inference.git &&     cd transformers-bloom-inference &&     make gen-proto &&     make ui  &&     make microsoft-bloom-176b` failed. (See above for error)
mkdir -p inference_server/model_handler/grpc_utils/pb
python -m grpc_tools.protoc -Iinference_server/model_handler/grpc_utils/proto --python_out=inference_server/model_handler/grpc_utils/pb --grpc_python_out=inference_server/model_handler/grpc_utils/pb inference_server/model_handler/grpc_utils/proto/generation.proto
find inference_server/model_handler/grpc_utils/pb/ -type f -name "*.py" -print0 -exec sed -i -e 's/^\(import.*pb2\)/from . \1/g' {} \;
inference_server/model_handler/grpc_utils/pb/generation_pb2.pyinference_server/model_handler/grpc_utils/pb/__init__.pyinference_server/model_handler/grpc_utils/pb/generation_pb2_grpc.pytouch inference_server/model_handler/grpc_utils/__init__.py
touch inference_server/model_handler/grpc_utils/pb/__init__.py
rm -rf inference_server/model_handler/grpc_utils/pb/*.py-e

mayank31398 commented 1 year ago

Can you try bloom-176b instead of microsoft-bloom-176b ? These weights load a bit slower but should work. Not sure. This looks like a bug in the makefile.

Also, I recommend downloading the model first via: https://github.com/huggingface/transformers-bloom-inference/blob/main/inference_server/download_model.py

python -m inference_server.download_model --model_name bigscience/bloom --model_class AutoModelForCausalLM

mayank31398 commented 1 year ago

nvm @vamsikrishnav , just realized that I had accidentally removed ui from the Makefile. I have pushed in a small fix :) Let me know if it works

vamsikrishnav commented 1 year ago

So, I am not familiar with sanic. @vamsikrishnav But you can always use the inference_server module in this repo :) The provided dockerfile runs a 560m model and I have tested it with bloom-176b too.

The makefile has the commands to host various models.

https://github.com/huggingface/transformers-bloom-inference/blob/b14224d180f365c30ce0e1ba053087eac7f4ee5c/Makefile#L13-L23

Please look at the dockerfile for the dependencies. Please also note that ds-inference is currently incompatible with pytorch 1.13. Please use torch==1.12.1. This is the launch command:

https://github.com/huggingface/transformers-bloom-inference/blob/b14224d180f365c30ce0e1ba053087eac7f4ee5c/Dockerfile#L63-L68

You can change to make bloom-176b

we want to use the deepspeed int 8 model because our hardware has 8 40Gb A-100's. So fp16 wont work for us as we cant load the entire model in memory.

vamsikrishnav commented 1 year ago

How do i run it if I have already pre-downloaded the model?

mayank31398 commented 1 year ago

You will need to modify the makefile Modify --model_name in bloom-176b and pass in the path there

vamsikrishnav commented 1 year ago

you mean for the ui: section in Makefile or for the MODEL_NAME under each model type?

I have changed the Makefile so:

bloom-176b-int8:
    TOKENIZERS_PARALLELISM=false \
    MODEL_NAME='/mnt/disk1/bloom-deepspeed-inference-int8' \
    MODEL_CLASS=AutoModelForCausalLM \
    DEPLOYMENT_FRAMEWORK=ds_inference \
    DTYPE=int8 \
    MAX_INPUT_LENGTH=2048 \
    MAX_BATCH_SIZE=4 \
    CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
    gunicorn -t 0 -w 1 -b 127.0.0.1:5000 inference_server.server:app --access-logfile - --access-logformat '%(h)s %(t)s "%(r)s" %(s)s %(b)s'

Now i get the error:

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /usr/lib/python3.8/runpy.py:194 in _run_module_as_main                                           │
│                                                                                                  │
│   191 │   main_globals = sys.modules["__main__"].__dict__                                        │
│   192 │   if alter_argv:                                                                         │
│   193 │   │   sys.argv[0] = mod_spec.origin                                                      │
│ ❱ 194 │   return _run_code(code, main_globals, None,                                             │
│   195 │   │   │   │   │    "__main__", mod_spec)                                                 │
│   196                                                                                            │
│   197 def run_module(mod_name, init_globals=None,                                                │
│                                                                                                  │
│ /usr/lib/python3.8/runpy.py:87 in _run_code                                                      │
│                                                                                                  │
│    84 │   │   │   │   │      __loader__ = loader,                                                │
│    85 │   │   │   │   │      __package__ = pkg_name,                                             │
│    86 │   │   │   │   │      __spec__ = mod_spec)                                                │
│ ❱  87 │   exec(code, run_globals)                                                                │
│    88 │   return run_globals                                                                     │
│    89                                                                                            │
│    90 def _run_module_code(code, init_globals=None,                                              │
│                                                                                                  │
│ /mnt/disk1/transformers-bloom-inference/inference_server/model_handler/launch.py:34 in <module>  │
│                                                                                                  │
│   31                                                                                             │
│   32                                                                                             │
│   33 if __name__ == "__main__":                                                                  │
│ ❱ 34 │   main()                                                                                  │
│   35                                                                                             │
│                                                                                                  │
│ /mnt/disk1/transformers-bloom-inference/inference_server/model_handler/launch.py:29 in main      │
│                                                                                                  │
│   26 def main():                                                                                 │
│   27 │   args = get_args()                                                                       │
│   28 │   start_inference_engine(args.deployment_framework)                                       │
│ ❱ 29 │   model = get_model_class(args.deployment_framework)(args)                                │
│   30 │   serve(model, args.ports[dist.get_rank()])                                               │
│   31                                                                                             │
│   32                                                                                             │
│                                                                                                  │
│ /mnt/disk1/transformers-bloom-inference/inference_server/models/ds_inference.py:33 in __init__   │
│                                                                                                  │
│    30 │   │   │   )                                                                              │
│    31 │   │   self.model = self.model.eval()                                                     │
│    32 │   │                                                                                      │
│ ❱  33 │   │   downloaded_model_path = get_model_path(args.model_name)                            │
│    34 │   │                                                                                      │
│    35 │   │   if args.dtype in [torch.float16, torch.int8]:                                      │
│    36 │   │   │   # We currently support the weights provided by microsoft (which are            │
│                                                                                                  │
│ /mnt/disk1/transformers-bloom-inference/inference_server/models/ds_inference.py:96 in            │
│ get_model_path                                                                                   │
│                                                                                                  │
│    93 │   config_file = "config.json"                                                            │
│    94 │                                                                                          │
│    95 │   # will fall back to HUGGINGFACE_HUB_CACHE                                              │
│ ❱  96 │   config_path = try_to_load_from_cache(model_name, config_file, cache_dir=os.getenv("T   │
│    97 │                                                                                          │
│    98 │   if config_path is not None:                                                            │
│    99 │   │   return os.path.dirname(config_path)                                                │
│                                                                                                  │
│ /mnt/disk1/inference/lib/python3.8/site-packages/huggingface_hub/utils/_validators.py:114 in     │
│ _inner_fn                                                                                        │
│                                                                                                  │
│   111 │   │   │   kwargs.items(),  # Kwargs values                                               │
│   112 │   │   ):                                                                                 │
│   113 │   │   │   if arg_name == "repo_id":                                                      │
│ ❱ 114 │   │   │   │   validate_repo_id(arg_value)                                                │
│   115 │   │   │                                                                                  │
│   116 │   │   │   elif arg_name == "token" and arg_value is not None:                            │
│   117 │   │   │   │   has_token = True                                                           │
│                                                                                                  │
│ /mnt/disk1/inference/lib/python3.8/site-packages/huggingface_hub/utils/_validators.py:166 in     │
│ validate_repo_id                                                                                 │
│                                                                                                  │
│   163 │   │   )                                                                                  │
│   164 │                                                                                          │
│   165 │   if repo_id.count("/") > 1:                                                             │
│ ❱ 166 │   │   raise HFValidationError(                                                           │
│   167 │   │   │   "Repo id must be in the form 'repo_name' or 'namespace/repo_name':"            │
│   168 │   │   │   f" '{repo_id}'. Use `repo_type` argument if needed."                           │
│   169 │   │   )                                                                                  │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/mnt/disk1/bloom-deepspeed-inference-int8'. Use `repo_type` argument if 
needed.

mayank31398 commented 1 year ago

I see. I think this is a problem with the int8 weights. Those don't obey the HF repo structure. Can you list the contents of /mnt/disk1/bloom-deepspeed-inference-int8.

For UI, you can use model_name = bigscience/bloom. The UI only uses to check if the model is encoder-decoder or decoder-only.

vamsikrishnav commented 1 year ago

That folder is just the git clone of the model repo. interestingly I was able to successfully able to run the script https://github.com/huggingface/transformers-bloom-inference/blob/main/bloom-inference-scripts/bloom-ds-inference.py with a little modification as a standalone script just to return one inference and it worked.

ubuntu@ip-172-31-33-15:/mnt/disk1/bloom-deepspeed-inference-int8$ ll
total 217328948
drwxrwxr-x  3 ubuntu ubuntu       4096 Feb 17 08:01 ./
drwxrwxrwx 10 root   root         4096 Feb 17 08:16 ../
drwxrwxr-x  9 ubuntu ubuntu       4096 Feb 10 17:55 .git/
-rw-rw-r--  1 ubuntu ubuntu       1389 Feb 10 17:36 .gitattributes
-rw-rw-r--  1 ubuntu ubuntu        726 Feb 10 17:36 README.md
-rw-rw-r--  1 ubuntu ubuntu        568 Feb 10 17:36 config.json
-rw-rw-r--  1 ubuntu ubuntu        854 Feb 10 17:36 ds_inference_config.json
-rw-rw-r--  1 ubuntu ubuntu 7193348595 Feb 10 17:41 non-tp.pt
-rw-rw-r--  1 ubuntu ubuntu         85 Feb 10 17:36 special_tokens_map.json
-rw-------  1 ubuntu ubuntu          0 Feb 17 08:01 tmp0fe0u99c
-rw-------  1 ubuntu ubuntu          0 Feb 17 08:01 tmp42c2525w
-rw-------  1 ubuntu ubuntu 3271557120 Feb 17 07:55 tmp589vy9l0
-rw-------  1 ubuntu ubuntu          0 Feb 17 07:55 tmp5can5q4u
-rw-------  1 ubuntu ubuntu          0 Feb 17 08:01 tmp5mds417b
-rw-------  1 ubuntu ubuntu 3187671040 Feb 17 07:55 tmp872amqxs
-rw-------  1 ubuntu ubuntu          0 Feb 17 07:55 tmp94sda1x9
-rw-------  1 ubuntu ubuntu          0 Feb 17 07:55 tmp_t4hvi9l
-rw-------  1 ubuntu ubuntu 2202009600 Feb 17 08:01 tmp_v7ahou2
-rw-------  1 ubuntu ubuntu 3229614080 Feb 17 07:55 tmpawwtxh3j
-rw-------  1 ubuntu ubuntu 2820669440 Feb 17 07:55 tmpbqvs42nm
-rw-------  1 ubuntu ubuntu          0 Feb 17 07:55 tmpd1vrk1k1
-rw-------  1 ubuntu ubuntu          0 Feb 17 07:55 tmpdbapn_yk
-rw-------  1 ubuntu ubuntu          0 Feb 17 08:01 tmpg1kvdcdf
-rw-------  1 ubuntu ubuntu 3250585600 Feb 17 07:55 tmpgilxn7aj
-rw-------  1 ubuntu ubuntu          0 Feb 17 08:01 tmpgkg4sht0
-rw-------  1 ubuntu ubuntu 2967470080 Feb 17 07:55 tmpgn3s3cpt
-rw-------  1 ubuntu ubuntu 2726297600 Feb 17 07:55 tmpgrmyq1ym
-rw-------  1 ubuntu ubuntu          0 Feb 17 08:01 tmpgu3pd764
-rw-------  1 ubuntu ubuntu 2757754880 Feb 17 08:01 tmpguvypme4
-rw-------  1 ubuntu ubuntu          0 Feb 17 08:01 tmph550uarl
-rw-------  1 ubuntu ubuntu          0 Feb 17 07:55 tmpidv2g8wy
-rw-------  1 ubuntu ubuntu 2243952640 Feb 17 08:01 tmpjbdgg7vs
-rw-------  1 ubuntu ubuntu 2789212160 Feb 17 08:01 tmpkeslzspe
-rw-------  1 ubuntu ubuntu          0 Feb 17 08:01 tmplk4bt41p
-rw-------  1 ubuntu ubuntu          0 Feb 17 07:55 tmpmd_6txu_
-rw-------  1 ubuntu ubuntu          0 Feb 17 07:55 tmpngbqw8bs
-rw-------  1 ubuntu ubuntu          0 Feb 17 08:01 tmpnima_h1d
-rw-------  1 ubuntu ubuntu 2359296000 Feb 17 08:01 tmpo8s0bq96
-rw-------  1 ubuntu ubuntu 2579496960 Feb 17 08:01 tmpojhr3_ht
-rw-------  1 ubuntu ubuntu          0 Feb 17 07:55 tmppzffzj8t
-rw-------  1 ubuntu ubuntu 2904555520 Feb 17 07:55 tmpqf5413ni
-rw-------  1 ubuntu ubuntu          0 Feb 17 07:55 tmpr_b8ezbu
-rw-------  1 ubuntu ubuntu          0 Feb 17 07:55 tmpt5p_4jcf
-rw-------  1 ubuntu ubuntu          0 Feb 17 07:55 tmpuwwon_cn
-rw-------  1 ubuntu ubuntu          0 Feb 17 08:01 tmpvrq6oy0i
-rw-------  1 ubuntu ubuntu          0 Feb 17 07:55 tmpws9deatw
-rw-------  1 ubuntu ubuntu          0 Feb 17 07:55 tmpwz8wozwp
-rw-------  1 ubuntu ubuntu  891289600 Feb 17 08:01 tmpy641ba34
-rw-------  1 ubuntu ubuntu 2453667840 Feb 17 08:01 tmpz5oj7lbv
-rw-rw-r--  1 ubuntu ubuntu   14500438 Feb 10 17:47 tokenizer.json
-rw-rw-r--  1 ubuntu ubuntu        222 Feb 10 17:36 tokenizer_config.json
-rw-rw-r--  1 ubuntu ubuntu 5139992291 Feb 10 17:50 tp_00_00.pt
-rw-rw-r--  1 ubuntu ubuntu 5551123341 Feb 10 17:39 tp_00_01.pt
-rw-rw-r--  1 ubuntu ubuntu 5551123341 Feb 10 17:39 tp_00_02.pt
-rw-rw-r--  1 ubuntu ubuntu 5551123341 Feb 10 17:39 tp_00_03.pt
-rw-rw-r--  1 ubuntu ubuntu 5551123341 Feb 10 17:39 tp_00_04.pt
-rw-rw-r--  1 ubuntu ubuntu 5551123341 Feb 10 17:40 tp_00_05.pt
-rw-rw-r--  1 ubuntu ubuntu 5551123341 Feb 10 17:42 tp_00_06.pt
-rw-rw-r--  1 ubuntu ubuntu 4728670699 Feb 10 17:50 tp_00_07.pt
-rw-rw-r--  1 ubuntu ubuntu 5139992291 Feb 10 17:46 tp_01_00.pt
-rw-rw-r--  1 ubuntu ubuntu 5551123341 Feb 10 17:42 tp_01_01.pt
-rw-rw-r--  1 ubuntu ubuntu 5551123341 Feb 10 17:41 tp_01_02.pt
-rw-rw-r--  1 ubuntu ubuntu 5551123341 Feb 10 17:44 tp_01_03.pt
-rw-rw-r--  1 ubuntu ubuntu 5551123341 Feb 10 17:46 tp_01_04.pt
-rw-rw-r--  1 ubuntu ubuntu 5551123341 Feb 10 17:44 tp_01_05.pt
-rw-rw-r--  1 ubuntu ubuntu 5551123341 Feb 10 17:46 tp_01_06.pt
-rw-rw-r--  1 ubuntu ubuntu 4728670699 Feb 10 17:48 tp_01_07.pt
-rw-rw-r--  1 ubuntu ubuntu 5139992291 Feb 10 17:48 tp_02_00.pt
-rw-rw-r--  1 ubuntu ubuntu 5551123341 Feb 10 17:45 tp_02_01.pt
-rw-rw-r--  1 ubuntu ubuntu 5551123341 Feb 10 17:44 tp_02_02.pt
-rw-rw-r--  1 ubuntu ubuntu 5551123341 Feb 10 17:45 tp_02_03.pt
-rw-rw-r--  1 ubuntu ubuntu 5551123341 Feb 10 17:42 tp_02_04.pt
-rw-rw-r--  1 ubuntu ubuntu 5551123341 Feb 10 17:47 tp_02_05.pt
-rw-rw-r--  1 ubuntu ubuntu 5551123341 Feb 10 17:44 tp_02_06.pt
-rw-rw-r--  1 ubuntu ubuntu 4728670699 Feb 10 17:48 tp_02_07.pt
-rw-rw-r--  1 ubuntu ubuntu 5139992291 Feb 10 17:47 tp_03_00.pt
-rw-rw-r--  1 ubuntu ubuntu 5551123341 Feb 10 17:39 tp_03_01.pt
-rw-rw-r--  1 ubuntu ubuntu 5551123341 Feb 10 17:46 tp_03_02.pt
-rw-rw-r--  1 ubuntu ubuntu 5551123341 Feb 10 17:40 tp_03_03.pt
-rw-rw-r--  1 ubuntu ubuntu 5551123341 Feb 10 17:42 tp_03_04.pt
-rw-rw-r--  1 ubuntu ubuntu 5551123341 Feb 10 17:41 tp_03_05.pt
-rw-rw-r--  1 ubuntu ubuntu 5551123341 Feb 10 17:44 tp_03_06.pt
-rw-rw-r--  1 ubuntu ubuntu 4728670699 Feb 10 17:48 tp_03_07.pt
-rw-rw-r--  1 ubuntu ubuntu          1 Feb 16 06:03 version.txt

mayank31398 commented 1 year ago

@vamsikrishnav can you try this branch: mayank/fix-path-for-int8 Let me know if it works. I will try to merge it by today if it does :)

vamsikrishnav commented 1 year ago

Ok I need one more information. The UI loads locally on a server on 127.0.0.1. I am testing on a cloud server. How can I have the UI running remotely? Alternatively I would want to make curl requests. do you have the sample payload for an inference call?

mayank31398 commented 1 year ago

I have never used curl but you can use server_request.py for running requests. Also, UI takes the following arguments which you can adjust: https://github.com/huggingface/transformers-bloom-inference/blob/0619d9acbaecf49849be3ec7717379198368f2c6/ui.py#L20-L24

server_request.py should ping the server_host and server_port.

mayank31398 commented 1 year ago

Also, please note that user queries are not batched in this repository. All user queries are treated sequentially one-after the other. However, a user can send a batch of queries for processing :)

vamsikrishnav commented 1 year ago

I just tried the server loads now in memory. I will try a few queries and update you on my progress. Thanks a lot Mayank. Can I add you in LinkedIn?

mayank31398 commented 1 year ago

Sure

mayank31398 commented 1 year ago

Should we close this?

huggingface / transformers-bloom-inference

How do build a web api for deepspeed inference #52