Download model files from GCS (Instead of HF Hub)

rm-jeremyduplessis commented 2 weeks ago

When deploying an HF model to Vertex AI, I would like to download a fine-tuned model from GCS, instead of from HF Hub, like so:

model = aiplatform.Model.upload(
    display_name="my-model",
    serving_container_image_uri=os.getenv("CONTAINER_URI"),
    serving_container_environment_variables={
        "AIP_STORAGE_URI": "gs://path/to/model/files",
    },
    serving_container_ports=[8080],
)
model.wait()

I would expect this to be supported since the entrypoint script logic should handle this: https://github.com/huggingface/Google-Cloud-Containers/blob/main/containers/tei/cpu/1.4.0/entrypoint.sh

Will this be supported when V1.4 is released? When will this be?

philschmid commented 2 weeks ago

Thank you for opening the issue @rm-jeremyduplessis. We are working with the google team to get a ETA. Will report back here.

alvarobartt commented 2 weeks ago

Hi here @rm-jeremyduplessis, TEI 1.4 just got recently released for GPU-only (CPU still under review), I'll go and update now the references within the repository; in the meantime, feel free to use the following container URI:

us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-embeddings-inference-cu122.1-4.ubuntu2204

You should be able to provide the artifact_uri with the Google Cloud Storage (GCS) Bucket with the model as show below:

model = aiplatform.Model.upload(
    display_name="my-model",
    artifact_uri="gs://path/to/model/files",
    serving_container_image_uri="us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-embeddings-inference-cu122.1-4.ubuntu2204",
    serving_container_ports=[8080],
)
model.wait()

[!NOTE] I updated your code snippet to use the artifact_uri instead which is essentially the same as using the AIP_STORAGE_URI, but with the minor difference that this way the AIP_STORAGE_URI environment variable is internally set by the python-aiplatform SDK; which is the recommended way, even though if you can easily bypass that setting the environment variable yourself as you did above. So feel free to go with whatever works better for your use case, while keeping in mind using the artifact_uri is the recommended approach!

rm-jeremyduplessis commented 2 weeks ago

Hi @alvarobartt,

Thanks so much for this!

I've been able to deploy infloat/e5-small (no fine-tuning) to a Vertex Endpoint to test like so:

model.deploy(
    endpoint=endpoint,
    machine_type="n1-standard-16",
    accelerator_type="NVIDIA_TESLA_T4",
    accelerator_count=1,
    min_replica_count=1,
    max_replica_count=1,
    sync=True
)
model.wait()

And while it did work initially, I've found what I think is a bug: I'm not sure exactly what's causing it, but if you call the model with a single input string a number of times it eventually starts returning null values - you can try this locally like so:

Screenshot 2024-08-30 at 14 47 32

Then once it starts returning null vectors that's all it returns. This also happens for batches after a number of calls.

I checked the endpoint metrics like CPU usage etc. and I can confirm it's not an OOM error or anything like that -- have you experienced this before?

UPDATE: I've tried this with a number of models locally (with a T4) and the same thing happens with various other embedding models on TEI 1.4, but doesn't appear to happen when using TEI 1.2 with the same models (e.g. intfloat/e5-large-v2)

alvarobartt commented 2 weeks ago

Thanks for reporting @rm-jeremyduplessis and for the detailed issue, we'll investigate this and come back to you, did you experienced the same within ghcr.io/huggingface/text-embeddings-inference container i.e. the one published within the https://github.com/huggingface/text-embeddings-inference repository? Thanks again!

rm-jeremyduplessis commented 1 week ago

@alvarobartt I just tested with ghcr.io/huggingface/text-embeddings-inference:turing-1.4 with intfloat/e5-large-v2 on an Nvidia T4 and it appears to work fine (no null vectors) - I notice there are different images for Turing/Ampere architectures, is this the case with these new GCP images or does the configuration happen under the hood at runtime?

alvarobartt commented 1 week ago

Hmm that's odd @rm-jeremyduplessis I just tried with the same model on the Hugging Face DLC for TEI on Vertex AI and these are the results of 32 consecutive requests:

What I've seen is that in your code you're using requests instead of sending those over via the Vertex AI SDK for Python, is there any reason for that? Here you can find the Jupyter Notebook attached to reproduce on your end, I guess the settings are the same, the only difference may be on the inference code.

alvarobartt commented 1 week ago

P.S. Also @rm-jeremyduplessis note that by default Vertex AI will be sending the prediction requests to the /predict route even though underneath is calling /embed so you should ideally use /predict instead when working on Vertex AI.

huggingface / Google-Cloud-Containers

Download model files from GCS (Instead of HF Hub) #73