michaelfeil / infinity

Infinity is a high-throughput, low-latency REST API for serving vector embeddings, supporting a wide range of text-embedding models and frameworks.
https://michaelfeil.eu/infinity/
MIT License
971 stars 72 forks source link

HF_HOME not respected #194

Closed WinsonSou closed 2 months ago

WinsonSou commented 3 months ago

System Info

root@infinity-embeddings-deployment-7b9f45cfcc-vrj9j:/app/.cache# printenv
KUBERNETES_SERVICE_PORT_HTTPS=443
NVIDIA_VISIBLE_DEVICES=GPU-71058c1d-14e4-c507-8f62-2c67e4d8b154
KUBERNETES_SERVICE_PORT=443
INFINITY_EMBEDDINGS_SERVICE_SERVICE_HOST=172.19.41.16
PYTHONUNBUFFERED=1
HOSTNAME=infinity-embeddings-deployment-7b9f45cfcc-vrj9j
VLLM_SERVICE_PORT=tcp://172.19.47.119:8000
NVIDIA_REQUIRE_CUDA=cuda>=12.1 brand=tesla,driver>=470,driver<471 brand=unknown,driver>=470,driver<471 brand=nvidia,driver>=470,driver<471 brand=nvidiartx,driver>=470,driver<471 brand=geforce,driver>=470,driver<471 brand=geforcertx,driver>=470,driver<471 brand=quadro,driver>=470,driver<471 brand=quadrortx,driver>=470,driver<471 brand=titan,driver>=470,driver<471 brand=titanrtx,driver>=470,driver<471 brand=tesla,driver>=525,driver<526 brand=unknown,driver>=525,driver<526 brand=nvidia,driver>=525,driver<526 brand=nvidiartx,driver>=525,driver<526 brand=geforce,driver>=525,driver<526 brand=geforcertx,driver>=525,driver<526 brand=quadro,driver>=525,driver<526 brand=quadrortx,driver>=525,driver<526 brand=titan,driver>=525,driver<526 brand=titanrtx,driver>=525,driver<526
VLLM_SERVICE_PORT_8000_TCP_PROTO=tcp
HUGGING_FACE_HUB_TOKEN=xxxxx
PWD=/app/.cache
NVIDIA_DRIVER_CAPABILITIES=compute,utility
NV_CUDA_CUDART_VERSION=12.1.55-1
HOME=/root
KUBERNETES_PORT_443_TCP=tcp://172.19.0.1:443
VLLM_SERVICE_SERVICE_HOST=172.19.47.119
LS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.zst=01;31:*.tzst=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.wim=01;31:*.swm=01;31:*.dwm=01;31:*.esd=01;31:*.jpg=01;35:*.jpeg=01;35:*.mjpg=01;35:*.mjpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.webp=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.m4a=00;36:*.mid=00;36:*.midi=00;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36:*.oga=00;36:*.opus=00;36:*.spx=00;36:*.xspf=00;36:
SENTENCE_TRANSFORMERS_HOME=/app/.cache/torch
CUDA_VERSION=12.1.0
POETRY_VIRTUALENVS_IN_PROJECT=true
PIP_DEFAULT_TIMEOUT=100
EXTRAS=all
POETRY_NO_INTERACTION=1
**HF_HOME=/root/.cache/huggingface**
TERM=xterm
PIP_DISABLE_PIP_VERSION_CHECK=on
PYTHON=python3.11
INFINITY_EMBEDDINGS_SERVICE_PORT=tcp://172.19.41.16:8000
SHLVL=1
NVARCH=x86_64
KUBERNETES_PORT_443_TCP_PROTO=tcp
INFINITY_EMBEDDINGS_SERVICE_PORT_8000_TCP_PORT=8000
VLLM_SERVICE_PORT_8000_TCP_ADDR=172.19.47.119
INFINITY_EMBEDDINGS_SERVICE_SERVICE_PORT=8000
KUBERNETES_PORT_443_TCP_ADDR=172.19.0.1
NV_CUDA_COMPAT_PACKAGE=cuda-compat-12-1
LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64
VLLM_SERVICE_PORT_8000_TCP_PORT=8000
KUBERNETES_SERVICE_HOST=172.19.0.1
KUBERNETES_PORT=tcp://172.19.0.1:443
KUBERNETES_PORT_443_TCP_PORT=443
INFINITY_EMBEDDINGS_SERVICE_PORT_8000_TCP_ADDR=172.19.41.16
VLLM_SERVICE_PORT_8000_TCP=tcp://172.19.47.119:8000
PATH=/app/.venv/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
INFINITY_EMBEDDINGS_SERVICE_PORT_8000_TCP_PROTO=tcp
INFINITY_EMBEDDINGS_SERVICE_PORT_8000_TCP=tcp://172.19.41.16:8000
PIP_NO_CACHE_DIR=off
VLLM_SERVICE_SERVICE_PORT=8000
_=/usr/bin/printenv
OLDPWD=/app

root@infinity-embeddings-deployment-7b9f45cfcc-vrj9j:/app/.cache# pwd
**/app/.cache**
root@infinity-embeddings-deployment-7b9f45cfcc-vrj9j:/app/.cache# du -h
4.0K    ./torch/models--Salesforce--SFR-Embedding-Mistral/snapshots/938c560d1c236aa563b2dbdf084f28ab28bccb11/1_Pooling
24K ./torch/models--Salesforce--SFR-Embedding-Mistral/snapshots/938c560d1c236aa563b2dbdf084f28ab28bccb11
28K ./torch/models--Salesforce--SFR-Embedding-Mistral/snapshots
8.0K    ./torch/models--Salesforce--SFR-Embedding-Mistral/refs
14G ./torch/models--Salesforce--SFR-Embedding-Mistral/blobs
4.0K    ./torch/models--Salesforce--SFR-Embedding-Mistral/.no_exist/938c560d1c236aa563b2dbdf084f28ab28bccb11
8.0K    ./torch/models--Salesforce--SFR-Embedding-Mistral/.no_exist
14G ./torch/models--Salesforce--SFR-Embedding-Mistral
4.0K    ./torch/.locks/models--Salesforce--SFR-Embedding-Mistral
8.0K    ./torch/.locks
14G ./torch
14G

Set HF_HOME to /root/.cache/huggingface

However the model is still getting downloaded to /app/.cache/torch

Information

Tasks

Reproduction

run with docker with HF_HOME environment variable.

Expected behavior

HF_HOME is respected

WinsonSou commented 3 months ago

Workaround is to mount volume into /app/.cache/torch

michaelfeil commented 3 months ago

Thanks for posting the workaround with the issue.

Looks like you are doing nothing wrong here, and the HF_HOME did not get respected.

WinsonSou commented 3 months ago

Nope, i'm just running the default docker image. docker run -it --gpus all -p $port:$port michaelf34/infinity:latest --model-name-or-path Salesforce/SFR-Embedding-Mistral --port $port --env HF_HOME /root/.cache/huggingface -v /modelcache:/root/.cache/huggingface

michaelfeil commented 2 months ago

Okay, potentially it's because I set a default in the Dockerfile ENV SENTENCE_TRANSFORMERS_HOME=app/.cache -- I'll add it #195 .

Planning to close this issue then! Thanks for making me aware of this!

michaelfeil commented 2 months ago

This is working now. Enjoy! Please make sure to pin the version, and test expected behaviour as you upgrade.

docker run -it --gpus all -e HF_HOME=/root/.cache/huggingface -v ./modelcache:/root/.cache  michaelf34/infinity:0.0.32
WinsonSou commented 2 months ago

Thank you Michael!