Open rm-jeremyduplessis opened 2 weeks ago
Thank you for opening the issue @rm-jeremyduplessis. We are working with the google team to get a ETA. Will report back here.
Hi here @rm-jeremyduplessis, TEI 1.4 just got recently released for GPU-only (CPU still under review), I'll go and update now the references within the repository; in the meantime, feel free to use the following container URI:
You should be able to provide the artifact_uri
with the Google Cloud Storage (GCS) Bucket with the model as show below:
model = aiplatform.Model.upload(
display_name="my-model",
artifact_uri="gs://path/to/model/files",
serving_container_image_uri="us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-embeddings-inference-cu122.1-4.ubuntu2204",
serving_container_ports=[8080],
)
model.wait()
[!NOTE] I updated your code snippet to use the
artifact_uri
instead which is essentially the same as using theAIP_STORAGE_URI
, but with the minor difference that this way theAIP_STORAGE_URI
environment variable is internally set by thepython-aiplatform
SDK; which is the recommended way, even though if you can easily bypass that setting the environment variable yourself as you did above. So feel free to go with whatever works better for your use case, while keeping in mind using theartifact_uri
is the recommended approach!
Hi @alvarobartt,
Thanks so much for this!
I've been able to deploy infloat/e5-small
(no fine-tuning) to a Vertex Endpoint to test like so:
model.deploy(
endpoint=endpoint,
machine_type="n1-standard-16",
accelerator_type="NVIDIA_TESLA_T4",
accelerator_count=1,
min_replica_count=1,
max_replica_count=1,
sync=True
)
model.wait()
And while it did work initially, I've found what I think is a bug: I'm not sure exactly what's causing it, but if you call the model with a single input string a number of times it eventually starts returning null
values - you can try this locally like so:
Then once it starts returning null
vectors that's all it returns. This also happens for batches after a number of calls.
I checked the endpoint metrics like CPU usage etc. and I can confirm it's not an OOM error or anything like that -- have you experienced this before?
UPDATE: I've tried this with a number of models locally (with a T4) and the same thing happens with various other embedding models on TEI 1.4, but doesn't appear to happen when using TEI 1.2 with the same models (e.g. intfloat/e5-large-v2
)
Thanks for reporting @rm-jeremyduplessis and for the detailed issue, we'll investigate this and come back to you, did you experienced the same within ghcr.io/huggingface/text-embeddings-inference container i.e. the one published within the https://github.com/huggingface/text-embeddings-inference repository? Thanks again!
@alvarobartt I just tested with ghcr.io/huggingface/text-embeddings-inference:turing-1.4
with intfloat/e5-large-v2
on an Nvidia T4 and it appears to work fine (no null
vectors) - I notice there are different images for Turing/Ampere architectures, is this the case with these new GCP images or does the configuration happen under the hood at runtime?
Hmm that's odd @rm-jeremyduplessis I just tried with the same model on the Hugging Face DLC for TEI on Vertex AI and these are the results of 32 consecutive requests:
What I've seen is that in your code you're using requests
instead of sending those over via the Vertex AI SDK for Python, is there any reason for that? Here you can find the Jupyter Notebook attached to reproduce on your end, I guess the settings are the same, the only difference may be on the inference code.
P.S. Also @rm-jeremyduplessis note that by default Vertex AI will be sending the prediction requests to the /predict
route even though underneath is calling /embed
so you should ideally use /predict
instead when working on Vertex AI.
When deploying an HF model to Vertex AI, I would like to download a fine-tuned model from GCS, instead of from HF Hub, like so:
I would expect this to be supported since the entrypoint script logic should handle this: https://github.com/huggingface/Google-Cloud-Containers/blob/main/containers/tei/cpu/1.4.0/entrypoint.sh
Will this be supported when V1.4 is released? When will this be?