kubeflow / fairing

Python SDK for building, training, and deploying ML models
Apache License 2.0
337 stars 144 forks source link

Kubeflow Fairing TrainJob creates an image with Root user and fairing job pod will not execute on AKS which has policy to not allow Docker containers running as Root user #525

Open pshah16 opened 4 years ago

pshah16 commented 4 years ago

/kind bug

What steps did you take and what happened:

I am running a simple fairing example shown here with Microsoft Azure backend.


from kubeflow import fairing from kubeflow.fairing import TrainJob from kubeflow.fairing.backends import KubeflowAzureBackend from kubeflow.fairing.kubernetes.utils import get_resource_mutator

class Trainer(object): def train(self): print("hello world!")

from kubeflow.fairing.builders.cluster.azurestorage_context import StorageContextSource BuildContext = StorageContextSource( region=AZURE_REGION, resource_group_name=AZURE_RESOURCE_GROUP, storage_account_name=AZURE_STORAGE_ACCOUNT ) job = TrainJob(Trainer, input_files=['ames_dataset/train.csv', "requirements.txt"], docker_registry=DOCKER_REGISTRY, base_docker_image = None, backend=KubeflowAzureBackend(build_context_source=BuildContext)) job.submit()


When job.submit() command executes, I get the following messages (no errors)...Then the command never finishes executing and nothing happens beyond this point.

[I 200722 19:15:28 azure:156] Creating secret 'storage-credentials-5a318d6e' in namespace 'pshah' [W 200722 19:15:29 manager:298] Waiting for fairing-builder-5zzdt-mxn9b to start... [W 200722 19:15:29 manager:298] Waiting for fairing-builder-5zzdt-mxn9b to start... [W 200722 19:15:29 manager:298] Waiting for fairing-builder-5zzdt-mxn9b to start... [W 200722 19:15:31 manager:298] Waiting for fairing-builder-5zzdt-mxn9b to start...

When I checked the status of the fairing job using kubectl, I noticed following: state: waiting: message: container has runAsNonRoot and image will run as root reason: CreateContainerConfigError

I checked with our cluster team they confirmed that our AKS cluster has a policy that will not allow Docker containers to run as Root user and hence the pod tries to schedule but never executes. When fairing creates an image, it has Root user by default in the image it built.

What did you expect to happen: The error should have been clearly displayed when executing the Trainjob.submit() command. It should not remain stuck waiting forever. Also, Kubeflow fairing commands (including Trainjob.submit()) needs to have some way or setting through which we can set the user as some other non-root user in the Docker image that it creates and pushes to the registry and executes on AKS.

Anything else you would like to add: How to run Fairing Train_job.submit() command successfully if my cluster has policy to not allow Docker images with root user?

Environment:

NOTE: If you are using fair from master, please provide us the git commit hash.

issue-label-bot[bot] commented 4 years ago

Issue Label Bot is not confident enough to auto-label this issue. See dashboard for more details.