huggingface / optimum-neuron

Easy, fast and very cheap training and inference on AWS Trainium and Inferentia chips.
Apache License 2.0
176 stars 53 forks source link

Inference compile cache script #504

Closed philschmid closed 4 months ago

philschmid commented 4 months ago

What does this PR do?

This PR adds a tools script to easily add LLMs to the public cache. It leverages the optimum-cli behind the scenes and checks if the version matches. There are currently 2 ways to use it

Single compilation:

huggingface-cli login --token hf_xxx # access to cache repo
python tools/cache_model_for_inference.py --hf_model_id "HuggingFaceH4/zephyr-7b-beta" --batch_size 1 --sequence_length 2048 --num_cores 2 --auto_cast_type fp16

File based compilation

python tools/auto_fill_inference_cache.py --config_file test.json

with a file

{
  "openai-community/gpt2": [
    {
      "batch_size": 1,
      "sequence_length": 1024,
      "num_cores": 1,
      "auto_cast_type": "fp16"
    }
  ],
  "meta-llama/Llama-2-7b-chat-hf": [
    {
      "batch_size": 1,
      "sequence_length": 4096,
      "num_cores": 2,
      "auto_cast_type": "fp16"
    },
    {
      "batch_size": 1,
      "sequence_length": 4096,
      "num_cores": 8,
      "auto_cast_type": "fp16"
    }
  ],

Remote file based config

python tools/auto_fill_inference_cache.py --config_file https://huggingface.co/aws-neuron/optimum-neuron-cache/raw/main/inference-cache-config/gpt2.json

The configs can be found in the aws-neuron/optimum-neuron-cache under inference-cache-config

philschmid commented 4 months ago

Tested with

python tools/auto_fill_inference_cache.py --config_file https://huggingface.co/aws-neuron/optimum-neuron-cache/raw/main/inference-cache-config/gpt2.json