Closed philschmid closed 4 months ago
This PR adds a tools script to easily add LLMs to the public cache. It leverages the optimum-cli behind the scenes and checks if the version matches. There are currently 2 ways to use it
tools
Single compilation:
huggingface-cli login --token hf_xxx # access to cache repo python tools/cache_model_for_inference.py --hf_model_id "HuggingFaceH4/zephyr-7b-beta" --batch_size 1 --sequence_length 2048 --num_cores 2 --auto_cast_type fp16
File based compilation
python tools/auto_fill_inference_cache.py --config_file test.json
with a file
{ "openai-community/gpt2": [ { "batch_size": 1, "sequence_length": 1024, "num_cores": 1, "auto_cast_type": "fp16" } ], "meta-llama/Llama-2-7b-chat-hf": [ { "batch_size": 1, "sequence_length": 4096, "num_cores": 2, "auto_cast_type": "fp16" }, { "batch_size": 1, "sequence_length": 4096, "num_cores": 8, "auto_cast_type": "fp16" } ],
Remote file based config
python tools/auto_fill_inference_cache.py --config_file https://huggingface.co/aws-neuron/optimum-neuron-cache/raw/main/inference-cache-config/gpt2.json
The configs can be found in the aws-neuron/optimum-neuron-cache under inference-cache-config
aws-neuron/optimum-neuron-cache
Tested with
What does this PR do?
This PR adds a
tools
script to easily add LLMs to the public cache. It leverages the optimum-cli behind the scenes and checks if the version matches. There are currently 2 ways to use itSingle compilation:
File based compilation
with a file
Remote file based config
The configs can be found in the
aws-neuron/optimum-neuron-cache
under inference-cache-config