Add LLM Pipeline - Githubissues

livepeer / ai-worker

MIT License

17 stars 27 forks source link

@rickstaa I have reviewed this and confirmed it works. Code needed to be rebased with new code gen updates from recent SDK releases. @kyriediculous can update this PR or we can move to the other PR.

Some brief research provided there are other implementations to serve LLM pipelines which was also briefly discussed with @kyriediculous. Settled on alternative implementations can be researched and tested if the need arises from user feedback. LLM SPE will continue to support and enhance this pipeline to suite the network requirements for the LLM pipeline as the network evolves.

Notes from review/testing:

I like the streamed response simply starting a second thread to do the inference using a pre-built text streamer from transformers library to send the text chunks back. Note the api for this class may change in the future per note in the transformers documentation .

There was only a couple small changes I made in addition to the changes needed to rebase this PR:

Moved the check_torch_cuda.py to the dev folder since it only provides a helper to check cuda version.
Fixed the logic on returning containers for managed containers. For streamed responses the container was returned right after the streamed response was started. This would allow another request to come in to the GPU and would potentially significantly slow down the first request that was still processing. I would suggest we start with 1 request in flight per GPU for managed containers and target a future enhancement to increase this with thorough testing and documentation of multiple requests in flight on one GPU can be completed timely.
- Note, external containers are not limited to this one request in flight at a time. It is expected that external containers have their own load balancing logic and return 500 error when overloaded. Also, the external containers start to slow down token/second as each request is added concurrently. I experienced connections closing/timing out when overloading a GPU to much when testing locally with 5 concurrent requests on 3080.

All comments have been addressed and commit history has been cleaned up

2024-09-27 13:32:48,339 INFO:     Started server process [1]
2024-09-27 13:32:48,339 INFO:     Waiting for application startup.
2024-09-27 13:32:55,774 - app.pipelines.llm - INFO - Local model path: /models/models--meta-llama--Meta-Llama-3.1-8B-Instruct
2024-09-27 13:32:55,774 - app.pipelines.llm - INFO - Directory contents: ['snapshots', 'refs', 'blobs']
2024-09-27 13:32:55,774 - app.pipelines.llm - INFO - Using fp16/bf16 precision
2024-09-27 13:32:55,798 - app.pipelines.llm - INFO - Max memory configuration: {0: '23GiB', 1: '23GiB', 'cpu': '26GiB'}
Loading checkpoint shards: 100%|██████████| 4/4 [00:00<00:00,  6.13it/s]
2024-09-27 13:33:04,805 - app.pipelines.llm - INFO - Model loaded and distributed. Device map: {'model.embed_tokens': 0, 'model.layers.0': 0, 'model.layers.1': 0, 'model.layers.2': 0, 'model.layers.3': 0, 'model.layers.4': 0, 'model.layers.5': 0, 'model.layers.6': 0, 'model.layers.7': 0, 'model.layers.8': 0, 'model.layers.9': 0, 'model.layers.10': 0, 'model.layers.11': 0, 'model.layers.12': 0, 'model.layers.13': 0, 'model.layers.14': 1, 'model.layers.15': 1, 'model.layers.16': 1, 'model.layers.17': 1, 'model.layers.18': 1, 'model.layers.19': 1, 'model.layers.20': 1, 'model.layers.21': 1, 'model.layers.22': 1, 'model.layers.23': 1, 'model.layers.24': 1, 'model.layers.25': 1, 'model.layers.26': 1, 'model.layers.27': 1, 'model.layers.28': 1, 'model.layers.29': 1, 'model.layers.30': 1, 'model.layers.31': 1, 'model.norm': 1, 'model.rotary_emb': 1, 'lm_head': 1}
/root/.pyenv/versions/3.11.10/lib/python3.11/site-packages/pydantic/_internal/_fields.py:160: UserWarning: Field "model_id" has conflict with protected namespace "model_".

You may be able to resolve this warning by setting `model_config['protected_namespaces'] = ()`.
  warnings.warn(
2024-09-27 13:33:04,869 - app.main - INFO - Started up with pipeline LLMPipeline(model_id=meta-llama/Meta-Llama-3.1-8B-Instruct)
2024-09-27 13:33:04,869 INFO:     Application startup complete.
2024-09-27 13:33:04,870 INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

livepeer / ai-worker

Add LLM Pipeline #137