opendatahub-io / caikit-tgis-serving

Apache License 2.0
19 stars 44 forks source link

[Bug] TGIS container fails to run on a FIPS cluster #130

Closed bdattoma closed 7 months ago

bdattoma commented 11 months ago

When deploying a LLM model using the new Caikit+TGIS architecture introduced with #107 , the TGIS container (i.e., transformer-container) fails to start if the cluster has FIPS cryptography enabled.

These are the 2 errors I got in the container logs There was a problem when trying to write in your cache folder (/.cache/huggingface/hub). You should set the environment variable TRANSFORMERS_CACHE to a writable directory. fips.c(145): OpenSSL internal error, assertion failed: FATAL FIPS SELFTEST FAILURE Note: the TRANSFORMERS_CACHE is actually set in the ServinRuntime

This was found on a OpenShift 4.13.18 cluster with RHODS 2.1.2 (aka 1.32.2) and KServe 0.11 installed

heyselbi commented 10 months ago

What needs to be done (more like notes to self):

bmcfeeters commented 8 months ago

As another data point, I have hit this issue with FIPS enabled OpenShift 4.13.12 cluster and Red Hat OpenShift Data Science operator 2.5.0

It appears it is the tokenizer Python module that is causing the crash. From a debug container I see the same issue as noted here from huggingface.

Unfortunately, I have to redeploy my entire cluster now to make progress since FIPS cannot be disabled after OpenShift is fully deployed and running.

bdattoma commented 8 months ago

As another data point, I have hit this issue with FIPS enabled OpenShift 4.13.12 cluster and Red Hat OpenShift Data Science operator 2.5.0

It appears it is the tokenizer Python module that is causing the crash. From a debug container I see the same issue as noted here from huggingface.

Unfortunately, I have to redeploy my entire cluster now to make progress since FIPS cannot be disabled after OpenShift is fully deployed and running.

@bmcfeeters thanks for sharing your case. This issue should be solved on the latest image versions of the runtime which is going to be shipped with operator 2.6.0

dtrifiro commented 7 months ago

Fixed in #171, due to this change, this was due to the way the virtualenv was being prepared in the Dockerfile