michaelfeil / infinity

Infinity is a high-throughput, low-latency serving engine for text-embeddings, reranking models, clip, clap and colpali
https://michaelfeil.github.io/infinity/
MIT License
1.51k stars 116 forks source link

[Doc] Documentation on how to run infinity on AWS Inf2 #408

Open marcomarinodev opened 1 month ago

marcomarinodev commented 1 month ago

Feature request

Hello, I would like to know if there are any kind of configuration I have to make to run infinity as a docker container inside an inf2 instance on AWS. I tried with the following command, but the models are working on cpu and they're not using the accelerators.

sudo docker run -p 7997:7997 \
              -v /bin/data:/data \
              --privileged \
              -d --restart=always \
              michaelf34/infinity:0.0.52-fa \
              v2 \
              --port 7997 \
              --model-id sentence-transformers/all-MiniLM-L6-v2 \
              --model-id Alibaba-NLP/gte-Qwen2-1.5B-instruct

Motivation

The embedding models do not take advantage of the existing neuron accelerators, but they use cpu instead

Your contribution

I can test it on my own ec2 inf2 instances and contribute to any improvements

tsensei commented 1 month ago

@marcomarinodev You'll need to mount your accelerator with --gpus all but first make sure nvidia container toolkit is installed and configured

tsensei commented 1 month ago

Correction : nvidia docker toolkit if you are using nvidia GPUs, but with AWS neuron, maybe look into this link : https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/index.html

marcomarinodev commented 1 month ago

I tried to add --gpus all, but I get the following error:

docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown.

It looks like the infinity docker image must be compliant with AWS Deep Learning Containers though: even adding --device=/dev/neuron0 didn't work because I see this piece inside infinity logs:

sentence_transformers.SentenceTransformer                              
INFO: Use pytorch device_name: cpu

also if I try to use neuron-ls inside the container I get that it is not found. Therefore I was wondering if you have the code for executing the benchmarks on AWS inferentia.

tsensei commented 1 month ago

I don't know much about AWS machines but on my nvidia T4 Azure machine I had to make sure - nvidia driver for ubuntu, nvidia cuda toolkit, nvidia dnn and nvidia container toolkit was installed - if that helps

marcomarinodev commented 1 month ago

I think @michaelfeil can help here

michaelfeil commented 1 month ago

@marcomarinodev Many warnings, I have not used neuron in the last 4 months.

Playbook:

@jimburtoft from AWS provided some intital guideance for me to better integrate in inferentia.

jimburtoft commented 1 month ago

@marcomarinodev You should use the Hugging Face AMI from the marketplace because it has all the drivers and libraries installed. The 10/8/24 version includes Neuron SDK 2.20. There is no charge for the image, just the instance.

In order to run a model on Inferentia, it needs to be compiled. Hugging Face does this inline for some models, but not these. I pre-compiled https://huggingface.co/aws-neuron/all-MiniLM-L6-v2-neuron for SDK 2.20, so you should be able to deploy it directly from HF.

If that works, other models can be compiled using the instructions in the model card. If the compilation process fails, support may need to be added to some of the Neuron libraries.

If you really want to make a docker file, you would need to install the Neuron libraries AND make sure the host image has the drivers installed. See https://github.com/huggingface/optimum-neuron/blob/018296c824ebae87cb00cc23f75b4493a5d9114e/text-generation-inference/Dockerfile#L92 for an example.

marcomarinodev commented 1 month ago

So, in order to have that model available in infinity, should I first compile the model so that it becomes compatible with neuron architecture?

jimburtoft commented 1 month ago

For the most part, yes. There are some edge cases if you are using the Hugging Face Optimum Neuron library. But, if you can't compile it with the "optimum-cli export neuron" command, it won't run on Neuron in Infinity.

marcomarinodev commented 1 month ago

@marcomarinodev Many warnings, I have not used neuron in the last 4 months.

Playbook:

@jimburtoft from AWS provided some intital guideance for me to better integrate in inferentia.

  • Is there a way to build a dockerfile.

I tried with your suggestion, but --engine neuron option is missing. When I try to run infinity_emb v2 --model-id sentence-transformers/all-MiniLM-L6-v2 --engine neuron I get:

Invalid value for '--engine': 'neuron' is not one of 'torch', 'ctranslate2', 'optimum', 'debugengine'. 

Any suggestions?

marcomarinodev commented 1 month ago

Hi @michaelfeil any thought regarding --engine neuron not available?

michaelfeil commented 1 month ago

@marcomarinodev Just added the engine to the cli, main branch only.

# using the AMI with torch installed
git clone https://github.com/michaelfeil/infinity
cd infinity/libs/infinity_emb
# install pip deps without overwriting the existing neuron installation
pip install . --no-deps 
pip install uvicorn fastapi orjson typer hf_transfer rich posthog huggingface_hub prometheus-fastapi-instrumentator  

Run command

infinity_emb v2 --engine neuron --model-id BAAI/bge-small-en-v1.5
infinity_emb v2 --engine neuron
INFO:     Started server process [2287105]
INFO:     Waiting for application startup.
INFO     2024-10-18 10:49:20,247 infinity_emb INFO: model=`michaelfeil/bge-small-en-v1.5` selected, using engine=`neuron` and      select_model.py:68
         device=`None`                                                                                                                               
ERROR:    Traceback (most recent call 
marcomarinodev commented 1 month ago

@michaelfeil I executed your commands and probably got the same error as yours (inf2.8xlarge with Amazon Linux 2):

[ec2-user@ip-XX-XXX-XXX-XXXinfinity_emb]$ infinity_emb v2 --engine neuron --model-id sentence-transformers/all-MiniLM-L6-v2
INFO:     Started server process [3214]
INFO:     Waiting for application startup.
INFO     2024-10-21 10:04:54,812 infinity_emb INFO: Creating 1engines: engines=['sentence-transformers/all-MiniLM-L6-v2']                                       infinity_server.py:88INFO     2024-10-21 10:04:54,815 infinity_emb INFO: Anonymized telemetry can be disabled via environment variable `DO_NOT_TRACK=1`.                                   telemetry.py:30INFO     2024-10-21 10:04:54,820 infinity_emb INFO: model=`sentence-transformers/all-MiniLM-L6-v2` selected, using engine=`neuron` and device=`None`               select_model.py:64ERROR:    Traceback (most recent call last):
  File "/home/ec2-user/.local/lib/python3.12/site-packages/starlette/routing.py", line 693, in lifespan
    async with self.lifespan_context(app) as maybe_state:
  File "/usr/local/lib/python3.12/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/ec2-user/.local/lib/python3.12/site-packages/infinity_emb/infinity_server.py", line 92, in lifespan
    app.engine_array = AsyncEngineArray.from_args(engine_args_list)  # type: ignore
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ec2-user/.local/lib/python3.12/site-packages/infinity_emb/engine.py", line 289, in from_args
    return cls(engines=tuple(engines))
                       ^^^^^^^^^^^^^^
  File "/home/ec2-user/.local/lib/python3.12/site-packages/infinity_emb/engine.py", line 68, in from_args
    engine = cls(**engine_args.to_dict(), _show_deprecation_warning=False)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ec2-user/.local/lib/python3.12/site-packages/infinity_emb/engine.py", line 55, in __init__
    self._model, self._min_inference_t, self._max_inference_t = select_model(self._engine_args)
                                                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ec2-user/.local/lib/python3.12/site-packages/infinity_emb/inference/select_model.py", line 72, in select_model
    loaded_engine = unloaded_engine.value(engine_args=engine_args)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ec2-user/.local/lib/python3.12/site-packages/infinity_emb/transformer/embedder/neuron.py", line 81, in __init__
    CHECK_OPTIMUM_NEURON.mark_required()
  File "/home/ec2-user/.local/lib/python3.12/site-packages/infinity_emb/_optional_imports.py", line 46, in mark_required
    self._raise_error()
  File "/home/ec2-user/.local/lib/python3.12/site-packages/infinity_emb/_optional_imports.py", line 57, in _raise_error
    raise ImportError(msg)
ImportError: optimum.neuron is not available. install via `pip install infinity-emb[neuronx]`

ERROR:    Application startup failed. Exiting.

then I checked if infinity-emb was there:

[ec2-user@ip-XX-XXX-XXX-XXXinfinity_emb]$ pip3.12 install infinity-emb[neuronx]
Defaulting to user installation because normal site-packages is not writeable
Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com
Requirement already satisfied: infinity-emb[neuronx] in /home/ec2-user/.local/lib/python3.12/site-packages (0.0.66)
WARNING: infinity-emb 0.0.66 does not provide the extra 'neuronx'
Requirement already satisfied: hf_transfer>=0.1.5 in /home/ec2-user/.local/lib/python3.12/site-packages (from infinity-emb[neuronx]) (0.1.8)
Requirement already satisfied: huggingface_hub in /home/ec2-user/.local/lib/python3.12/site-packages (from infinity-emb[neuronx]) (0.26.0)
Requirement already satisfied: numpy<2,>=1.20.0 in /home/ec2-user/.local/lib/python3.12/site-packages (from infinity-emb[neuronx]) (1.26.4)
Requirement already satisfied: filelock in /home/ec2-user/.local/lib/python3.12/site-packages (from huggingface_hub->infinity-emb[neuronx]) (3.16.1)
Requirement already satisfied: fsspec>=2023.5.0 in /home/ec2-user/.local/lib/python3.12/site-packages (from huggingface_hub->infinity-emb[neuronx]) (2024.10.0)
Requirement already satisfied: packaging>=20.9 in /home/ec2-user/.local/lib/python3.12/site-packages (from huggingface_hub->infinity-emb[neuronx]) (24.1)
Requirement already satisfied: pyyaml>=5.1 in /home/ec2-user/.local/lib/python3.12/site-packages (from huggingface_hub->infinity-emb[neuronx]) (6.0.2)
Requirement already satisfied: requests in /home/ec2-user/.local/lib/python3.12/site-packages (from huggingface_hub->infinity-emb[neuronx]) (2.32.3)
Requirement already satisfied: tqdm>=4.42.1 in /home/ec2-user/.local/lib/python3.12/site-packages (from huggingface_hub->infinity-emb[neuronx]) (4.66.5)
Requirement already satisfied: typing-extensions>=3.7.4.3 in /home/ec2-user/.local/lib/python3.12/site-packages (from huggingface_hub->infinity-emb[neuronx]) (4.12.2)
Requirement already satisfied: charset-normalizer<4,>=2 in /home/ec2-user/.local/lib/python3.12/site-packages (from requests->huggingface_hub->infinity-emb[neuronx]) (3.4.0)
Requirement already satisfied: idna<4,>=2.5 in /home/ec2-user/.local/lib/python3.12/site-packages (from requests->huggingface_hub->infinity-emb[neuronx]) (3.10)
Requirement already satisfied: urllib3<3,>=1.21.1 in /home/ec2-user/.local/lib/python3.12/site-packages (from requests->huggingface_hub->infinity-emb[neuronx]) (2.2.3)
Requirement already satisfied: certifi>=2017.4.17 in /home/ec2-user/.local/lib/python3.12/site-packages (from requests->huggingface_hub->infinity-emb[neuronx]) (2024.8.30)
michaelfeil commented 1 month ago

@marcomarinodev pip install infinity-emb[neuronx] was auto-generated, its currently not an option & also installing it via pip would be a complicated setup. It seems like you did not use the above commands to install, since transformers neuronx is missing on your AMI. Its there by default. Maybe you created a venv, or overwrote the existing installation transformers-neuronx?