GPU not consuming for Katib experiment - GKE Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory

farisfirenze commented 2 years ago

/kind bug

What steps did you take and what happened: I am trying to create a kubeflow pipeline that tunes the hyper parameters of a text classification model in tensorflow using katib on GKE clusters. I created a cluster using the below commands

CLUSTER_NAME="kubeflow-pipelines-standalone-v2"
ZONE="us-central1-a"
MACHINE_TYPE="n1-standard-2"
SCOPES="cloud-platform"
NODES_NUM=1

gcloud container clusters create $CLUSTER_NAME --zone $ZONE --machine-type $MACHINE_TYPE --scopes $SCOPES --num-nodes $NODES_NUM

gcloud config set compute/zone $ZONE
gcloud container clusters get-credentials $CLUSTER_NAME

export PIPELINE_VERSION=1.8.2
kubectl apply -k "github.com/kubeflow/pipelines/manifests/kustomize/cluster-scoped-resources?ref=$PIPELINE_VERSION"
kubectl wait --for condition=established --timeout=60s crd/applications.app.k8s.io
kubectl apply -k "github.com/kubeflow/pipelines/manifests/kustomize/env/dev?ref=$PIPELINE_VERSION"
# katib
kubectl apply -k "github.com/kubeflow/katib.git/manifests/v1beta1/installs/katib-standalone?ref=v0.13.0"
kubectl apply -k "github.com/kubeflow/training-operator/manifests/overlays/standalone?ref=v1.4.0"
kubectl apply -f ./test.yaml

# disabling caching
export NAMESPACE=kubeflow
kubectl get mutatingwebhookconfiguration cache-webhook-${NAMESPACE}
kubectl patch mutatingwebhookconfiguration cache-webhook-${NAMESPACE} --type='json' -p='[{"op":"replace", "path": "/webhooks/0/rules/0/operations/0", "value": "DELETE"}]'

kubectl describe configmap inverse-proxy-config -n kubeflow | grep googleusercontent.com

GPU_POOL_NAME="gpu-pool2"
CLUSTER_NAME="kubeflow-pipelines-standalone-v2"
CLUSTER_ZONE="us-central1-a"
GPU_TYPE="nvidia-tesla-k80"
GPU_COUNT=1
MACHINE_TYPE="n1-highmem-8"
NODES_NUM=1

# Node pool creation may take several minutes.
gcloud container node-pools create ${GPU_POOL_NAME} --accelerator type=${GPU_TYPE},count=${GPU_COUNT} --zone ${CLUSTER_ZONE} --cluster ${CLUSTER_NAME} --num-nodes=0 --machine-type=${MACHINE_TYPE} --scopes=cloud-platform --num-nodes $NODES_NUM

kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml
kubectl apply -f https://raw.githubusercontent.com/kubernetes/kubernetes/master/cluster/addons/device-plugins/nvidia-gpu/daemonset.yaml

I then created a kubeflow pipeline:


from kfp import compiler
import kfp
import kfp.dsl as dsl
from kfp import components

@dsl.pipeline(
    name="End to End Pipeline",
    description="An end to end mnist example including hyperparameter tuning, train and inference"
)
def pipeline_func(
    time_loc = "gs://faris_bucket_us_central/Pipeline_data/input_dataset/dbpedia_model/GKE_Katib/time_csv/",
    hyper_image_uri_train = "gcr.io/.............../hptunekatib:v7",
    hyper_image_uri = "gcr.io/.............../hptunekatibclient:v7",
    model_uri = "gs://faris_bucket_us_central/Pipeline_data/dbpedia_hyper_models/GKE_Katib/",
    experiment_name = "dbpedia-exp-1",
    experiment_namespace = "kubeflow",
    experiment_timeout_minutes = 60
):

    # first stage : ingest and preprocess -> returns uploaded gcs URI for the pre processed dataset, setting memmory to 32GB, CPU to 4 CPU
    hp_tune = dsl.ContainerOp(
          name='hp-tune-katib',
          image=hyper_image_uri,
          arguments=[
            '--experiment_name', experiment_name,
            '--experiment_namespace', experiment_namespace,
            '--experiment_timeout_minutes', experiment_timeout_minutes,
            '--delete_after_done', True,
            '--hyper_image_uri', hyper_image_uri_train,
            '--time_loc', time_loc, 
            '--model_uri', model_uri

          ],
          file_outputs={'best-params': '/output.txt'}
        ).set_gpu_limit(1)

    # restricting the maximum usable memory and cpu for preprocess stage
    hp_tune.set_memory_limit("49G")
    hp_tune.set_cpu_limit("7")

# Run the Kubeflow Pipeline in the user's namespace.
if __name__ == '__main__':

    # compiling the model and generating tar.gz file to upload to Kubeflow Pipeline UI
    import kfp.compiler as compiler

    compiler.Compiler().compile(
        pipeline_func, 'pipeline_db.tar.gz'
    )

These are my two continers.

To launch the katib experiments based on the specified parameters and arguments passed to the dsl.ContainerOp()
The main training script for text classification. This container is passed as "image" to the trial spec for katib

gcr.io/.............../hptunekatibclient:v7

# importing required packages
import argparse
import datetime
from datetime import datetime as dt
from distutils.util import strtobool
import json
import os
import logging
import time
import pandas as pd
from google.cloud import storage
from pytz import timezone

from kubernetes.client import V1ObjectMeta

from kubeflow.katib import KatibClient
from kubeflow.katib import ApiClient
from kubeflow.katib import V1beta1Experiment

from kubeflow.katib import ApiClient
from kubeflow.katib import V1beta1ExperimentSpec
from kubeflow.katib import V1beta1AlgorithmSpec
from kubeflow.katib import V1beta1ObjectiveSpec
from kubeflow.katib import V1beta1ParameterSpec
from kubeflow.katib import V1beta1FeasibleSpace
from kubeflow.katib import V1beta1TrialTemplate
from kubeflow.katib import V1beta1TrialParameterSpec
from kubeflow.katib import V1beta1MetricsCollectorSpec
from kubeflow.katib import V1beta1CollectorSpec
from kubeflow.katib import V1beta1FileSystemPath
from kubeflow.katib import V1beta1SourceSpec
from kubeflow.katib import V1beta1FilterSpec

logger = logging.getLogger()
logging.basicConfig(level=logging.INFO)

FINISH_CONDITIONS = ["Succeeded", "Failed"]

# function to record the start time and end time to calculate execution time, pipeline start up and teardown time
def write_time(types, time_loc):

    formats = "%Y-%m-%d %I:%M:%S %p"

    now_utc = dt.now(timezone('UTC'))
    now_asia = now_utc.astimezone(timezone('Asia/Kolkata'))
    start_time = str(now_asia.strftime(formats))
    time_df = pd.DataFrame({"time":[start_time]})
    print("written")
    time_df.to_csv(time_loc + types + ".csv", index=False)

def get_args():
    parser = argparse.ArgumentParser(description='Katib Experiment launcher')
    parser.add_argument('--experiment_name', type=str,
                        help='Experiment name')
    parser.add_argument('--experiment_namespace', type=str, default='anonymous',
                        help='Experiment namespace')
    parser.add_argument('--experiment_timeout_minutes', type=int, default=60*24,
                        help='Time in minutes to wait for the Experiment to complete')
    parser.add_argument('--delete_after_done', type=strtobool, default=True,
                        help='Whether to delete the Experiment after it is finished')
    parser.add_argument('--hyper_image_uri', type=str, default="gcr.io/.............../hptunekatib:v2",
                        help='Hyper image uri')
    parser.add_argument('--time_loc', type=str, default="gs://faris_bucket_us_central/Pipeline_data/input_dataset/dbpedia_model/GKE_Katib/time_csv/",
                        help='Time loc')
    parser.add_argument('--model_uri', type=str, default="gs://faris_bucket_us_central/Pipeline_data/dbpedia_hyper_models/GKE_Katib/",
                        help='Model URI')

    return parser.parse_args()

def wait_experiment_finish(katib_client, experiment, timeout):
    polling_interval = datetime.timedelta(seconds=30)
    end_time = datetime.datetime.now() + datetime.timedelta(minutes=timeout)
    experiment_name = experiment.metadata.name
    experiment_namespace = experiment.metadata.namespace
    while True:
        current_status = None
        try:
            current_status = katib_client.get_experiment_status(name=experiment_name, namespace=experiment_namespace)
        except Exception as e:
            logger.info("Unable to get current status for the Experiment: {} in namespace: {}. Exception: {}".format(
                experiment_name, experiment_namespace, e))
        # If Experiment has reached complete condition, exit the loop.
        if current_status in FINISH_CONDITIONS:
            logger.info("Experiment: {} in namespace: {} has reached the end condition: {}".format(
                experiment_name, experiment_namespace, current_status))
            return
        # Print the current condition.
        logger.info("Current condition for Experiment: {} in namespace: {} is: {}".format(
            experiment_name, experiment_namespace, current_status))
        # If timeout has been reached, rise an exception.
        if datetime.datetime.now() > end_time:
            raise Exception("Timout waiting for Experiment: {} in namespace: {} "
                            "to reach one of these conditions: {}".format(
                                experiment_name, experiment_namespace, FINISH_CONDITIONS))
        # Sleep for poll interval.
        time.sleep(polling_interval.seconds)

if __name__ == "__main__":

    args = get_args()

    write_time("hyper_parameter_tuning_start", args.time_loc)

    # Trial count specification.
    max_trial_count = 2
    max_failed_trial_count = 2
    parallel_trial_count = 1

    # Objective specification.
    objective = V1beta1ObjectiveSpec(
        type="minimize",
        # goal=100,
        objective_metric_name="accuracy"
        # additional_metric_names=["accuracy"]
    )

    # Objective specification.
#     metrics_collector_specs = V1beta1MetricsCollectorSpec(
#         collector=V1beta1CollectorSpec(kind="File"),
#         source=V1beta1SourceSpec(
#             file_system_path=V1beta1FileSystemPath(
#                 # format="TEXT",
#                 path="/opt/trainer/katib/metrics.log",
#                 kind="File"
#             ),
#             filter=V1beta1FilterSpec(
#                 # metrics_format=["{metricName: ([\\w|-]+), metricValue: ((-?\\d+)(\\.\\d+)?)}"]
#                 metrics_format=["([\w|-]+)\s*=\s*([+-]?\d*(\.\d+)?([Ee][+-]?\d+)?)"]

#             )
#         )
#     )

    # Algorithm specification.
    algorithm = V1beta1AlgorithmSpec(
        algorithm_name="random",
    )

    # Experiment search space.
    # In this example we tune learning rate and batch size.
    parameters = [
        V1beta1ParameterSpec(
            name="batch_size",
            parameter_type="discrete",
            feasible_space=V1beta1FeasibleSpace(
                list=["32", "42", "52", "62", "64"]
            ),
        ),
        V1beta1ParameterSpec(
            name="learning_rate",
            parameter_type="double",
            feasible_space=V1beta1FeasibleSpace(
                min="0.001",
                max="0.005"
            ),
        )
    ]

    # TODO (andreyvelich): Use community image for the mnist example.
    trial_spec = {
        "apiVersion": "kubeflow.org/v1",
        "kind": "TFJob",
        "spec": {
            "tfReplicaSpecs": {
                "PS": {
                    "replicas": 1,
                    "restartPolicy": "Never",
                    "template": {
                        "metadata": {
                            "annotations": {
                                "sidecar.istio.io/inject": "false",
                            }
                        },
                        "spec": {
                            "containers": [
                                {
                                    "name": "tensorflow",
                                    "image": args.hyper_image_uri,
                                    "command": [
                                        "python",
                                        "/opt/trainer/task.py",
                                        "--model_uri=" + args.model_uri,
                                        "--batch_size=${trialParameters.batchSize}",
                                        "--learning_rate=${trialParameters.learningRate}"

                                    ],
                                    "ports" : [
                                        {
                                            "containerPort": 2222,
                                            "name" : "tfjob-port"
                                        }
                                    ]
                                    # "resources": {
                                    #     "limits" : {
                                    #         "cpu": "1"
                                    #     }
                                    # }
                                }
                            ]
                        }
                    }
                },
                "Worker": {
                    "replicas": 1,
                    "restartPolicy": "Never",
                    "template": {
                        "metadata": {
                            "annotations": {
                                "sidecar.istio.io/inject": "false",
                            }
                        },
                        "spec": {
                            "containers": [
                                {
                                    "name": "tensorflow",
                                    "image": args.hyper_image_uri,
                                    "command": [
                                        "python",
                                        "/opt/trainer/task.py",
                                        "--model_uri=" + args.model_uri,
                                        "--batch_size=${trialParameters.batchSize}",
                                        "--learning_rate=${trialParameters.learningRate}"
                                    ],
                                    "ports" : [
                                        {
                                            "containerPort": 2222,
                                            "name" : "tfjob-port"
                                        }
                                    ]
                                    # "resources": {
                                    #     "limits" : {
                                    #         "nvidia.com/gpu": 1
                                    #     }
                                    # }
                                }
                            ]
                        }
                    }
                }
            }
        }
    }

    # Configure parameters for the Trial template.
    trial_template = V1beta1TrialTemplate(
        primary_container_name="tensorflow",
        trial_parameters=[
            V1beta1TrialParameterSpec(
                name="batchSize",
                description="batch size",
                reference="batch_size"
            ),
            V1beta1TrialParameterSpec(
                name="learningRate",
                description="Learning rate",
                reference="learning_rate"
            ),
        ],
        trial_spec=trial_spec
    )

    # Create an Experiment from the above parameters.
    experiment_spec = V1beta1ExperimentSpec(
        max_trial_count=max_trial_count,
        max_failed_trial_count=max_failed_trial_count,
        parallel_trial_count=parallel_trial_count,
        objective=objective,
        algorithm=algorithm,
        parameters=parameters,
        trial_template=trial_template
    )

    experiment_name = args.experiment_name
    experiment_namespace = args.experiment_namespace

    logger.info("Creating Experiment: {} in namespace: {}".format(experiment_name, experiment_namespace))

    # Create Experiment object.
    experiment = V1beta1Experiment(
        api_version="kubeflow.org/v1beta1",
        kind="Experiment",
        metadata=V1ObjectMeta(
            name=experiment_name,
            namespace=experiment_namespace
        ),
        spec=experiment_spec
    )
    logger.info("Experiment Spec : " + str(experiment_spec))

    logger.info("Experiment: " + str(experiment))

    # Create Katib client.
    katib_client = KatibClient()
    # Create Experiment in Kubernetes cluster.
    output = katib_client.create_experiment(experiment, namespace=experiment_namespace)

    # Wait until Experiment is created.
    end_time = datetime.datetime.now() + datetime.timedelta(minutes=60)
    while True:
        current_status = None
        # Try to get Experiment status.
        try:
            current_status = katib_client.get_experiment_status(name=experiment_name, namespace=experiment_namespace)
        except Exception:
            logger.info("Waiting until Experiment is created...")
        # If current status is set, exit the loop.
        if current_status is not None:
            break
        # If timeout has been reached, rise an exception.
        if datetime.datetime.now() > end_time:
            raise Exception("Timout waiting for Experiment: {} in namespace: {} to be created".format(
                experiment_name, experiment_namespace))
        time.sleep(1)

    logger.info("Experiment is created")

    # Wait for Experiment finish.
    wait_experiment_finish(katib_client, experiment, args.experiment_timeout_minutes)

    # Check if Experiment is successful.
    if katib_client.is_experiment_succeeded(name=experiment_name, namespace=experiment_namespace):
        logger.info("Experiment: {} in namespace: {} is successful".format(
            experiment_name, experiment_namespace))

        optimal_hp = katib_client.get_optimal_hyperparameters(
            name=experiment_name, namespace=experiment_namespace)
        logger.info("Optimal hyperparameters:\n{}".format(optimal_hp))

        # # Create dir if it doesn't exist.
        # if not os.path.exists(os.path.dirname("output.txt")):
        #     os.makedirs(os.path.dirname("output.txt"))
        # Save HyperParameters to the file.
        with open("output.txt", 'w') as f:
            f.write(json.dumps(optimal_hp))
    else:
        logger.info("Experiment: {} in namespace: {} is failed".format(
            experiment_name, experiment_namespace))
        # Print Experiment if it is failed.
        experiment = katib_client.get_experiment(name=experiment_name, namespace=experiment_namespace)
        logger.info(experiment)

    # Delete Experiment if it is needed.
    if args.delete_after_done:
        katib_client.delete_experiment(name=experiment_name, namespace=experiment_namespace)
        logger.info("Experiment: {} in namespace: {} has been deleted".format(
            experiment_name, experiment_namespace))

    write_time("hyper_parameter_tuning_end", args.time_loc)

Dockerfile

FROM gcr.io/deeplearning-platform-release/tf-gpu.2-8

# installing packages
RUN pip install pandas
RUN pip install gcsfs
RUN pip install google-cloud-storage
RUN pip install pytz
RUN pip install kubernetes
RUN pip install kubeflow-katib
# moving code to preprocess

RUN mkdir /hp_tune
COPY task.py /hp_tune

# CREDENTIAL Authentication
COPY /prj-vertex-ai-2c390f7e8fec.json /hp_tune/prj-vertex-ai-2c390f7e8fec.json
ENV GOOGLE_APPLICATION_CREDENTIALS="/hp_tune/prj-vertex-ai-2c390f7e8fec.json"

# entry point
ENTRYPOINT ["python3", "/hp_tune/task.py"]

gcr.io/.............../hptunekatib:v7

# import os
# os.system("pip install tensorflow-gpu==2.8.0")

from sklearn.preprocessing import LabelEncoder
import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_text as text
import os
from tensorflow.keras.layers import Conv1D, MaxPool1D ,Embedding ,concatenate
from tensorflow.keras.layers import Activation, Dropout, Flatten, Dense,Input 
from tensorflow.keras.models import Model 
from tensorflow import keras
from datetime import datetime
from pytz import timezone
from sklearn.model_selection import train_test_split
import pandas as pd
from google.cloud import storage
import argparse
import logging

logger = logging.getLogger()
logging.basicConfig(level=logging.INFO)    

logger.info("Num GPUs Available: " + str(tf.config.list_physical_devices('GPU')))
import subprocess
process = subprocess.Popen(['sh', '-c', 'nvidia-smi'], stdout=subprocess.PIPE, stderr=subprocess.PIPE)
out, err = process.communicate()
logger.info("NVIDIA SMI " + str(out))
def format_strs(x):
    strs = ""
    if x > 0:
        sign_t = "+"
        strs += "+"
    else:
        sign_t = "-"

        strs += "-"

    strs = strs + "{:.1e}".format(x)

    if "+" in strs[1:]:
        sign = "+"
        strs = strs[1:].split("+")
    else:
        sign = "-"
        strs = strs[1:].split("-")

    last_d = strs[1][1:] if strs[1][0] == "0" else strs[1]

    strs_f = sign_t + strs[0] + sign + last_d
    return strs_f

def get_args():
    '''Parses args. Must include all hyperparameters you want to tune.'''

    parser = argparse.ArgumentParser()

    parser.add_argument(
      '--learning_rate',
      required=True,
      type=float,
      help='learning_rate')

    parser.add_argument(
      '--batch_size',
      required=True,
      type=int,
      help='batch_size')

    parser.add_argument(
      '--model_uri',
      required=True,
      type=str,
      help='Model Uri')

    args = parser.parse_args()
    return args

def download_blob(bucket_name, source_blob_name, destination_file_name):
    """Downloads a blob from the bucket."""
    # The ID of your GCS bucket
    # bucket_name = "your-bucket-name"

    # The ID of your GCS object
    # source_blob_name = "storage-object-name"

    # The path to which the file should be downloaded
    # destination_file_name = "local/path/to/file"

    storage_client = storage.Client()

    bucket = storage_client.bucket(bucket_name)

    # Construct a client side representation of a blob.
    # Note `Bucket.blob` differs from `Bucket.get_blob` as it doesn't retrieve
    # any content from Google Cloud Storage. As we don't need additional data,
    # using `Bucket.blob` is preferred here.
    blob = bucket.blob(source_blob_name)
    blob.download_to_filename(destination_file_name)

def create_dataset():

    download_blob("faris_bucket_us_central", "Pipeline_data/input_dataset/dbpedia_model/data/" + "train.csv", "train.csv")

    trainData = pd.read_csv('train.csv')
    trainData.columns = ['label','title','description']

    # trainData = trainData.sample(frac=0.002)

    X_train, X_test, y_train, y_test = train_test_split(trainData['description'], trainData['label'], stratify=trainData['label'], test_size=0.1, random_state=0)

    return X_train, X_test, y_train, y_test

def train_model(train_X, train_y, test_X, test_y, learning_rate, batch_size):

    logger.info("Training with lr = " + str(learning_rate) + "bs = " + str(batch_size))
    bert_preprocess = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3")
    bert_encoder = hub.KerasLayer("https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-12_H-128_A-2/2", trainable=False)

    text_input = tf.keras.layers.Input(shape=(), dtype=tf.string, name='text')
    preprocessed_text = bert_preprocess(text_input)
    outputs = bert_encoder(preprocessed_text)

    # Neural network layers
    l = tf.keras.layers.Dropout(0.2, name="dropout")(outputs['pooled_output']) # dropout_rate
    l = tf.keras.layers.Dense(14,activation='softmax',kernel_initializer=tf.keras.initializers.GlorotNormal(seed=24))(l) # dense_units

    model = Model(inputs=[text_input], outputs=l)

    model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=learning_rate),loss='categorical_crossentropy',metrics=['accuracy'])

    history = model.fit(train_X, train_y, epochs=5, validation_data=(test_X, test_y), batch_size=batch_size)

    return model, history

def main():

    args = get_args()
    logger.info("Creating dataset")
    train_X, test_X, train_y, test_y = create_dataset()

    # one_hot_encoding the class label
    encoder = LabelEncoder()
    encoder.fit(train_y)
    y_train_encoded = encoder.transform(train_y)
    y_test_encoded = encoder.transform(test_y)

    y_train_ohe = tf.keras.utils.to_categorical(y_train_encoded)
    y_test_ohe = tf.keras.utils.to_categorical(y_test_encoded)

    logger.info("Training model")
    model = train_model(
        train_X,
        y_train_ohe,
        test_X,
        y_test_ohe,
        args.learning_rate,
        int(float(args.batch_size))
    )

    logger.info("Saving model")
    artifact_filename = 'saved_model'
    local_path = artifact_filename
    tf.saved_model.save(model[0], local_path)

    # Upload model artifact to Cloud Storage
    model_directory = args.model_uri + "-".join(os.environ["HOSTNAME"].split("-")[:-2]) + "/"
    local_path = "saved_model/assets/vocab.txt"
    storage_path = os.path.join(model_directory, "assets/vocab.txt")
    blob = storage.blob.Blob.from_string(storage_path, client=storage.Client())
    blob.upload_from_filename(local_path)

    local_path = "saved_model/variables/variables.data-00000-of-00001"
    storage_path = os.path.join(model_directory, "variables/variables.data-00000-of-00001")
    blob = storage.blob.Blob.from_string(storage_path, client=storage.Client())
    blob.upload_from_filename(local_path)

    local_path = "saved_model/variables/variables.index"
    storage_path = os.path.join(model_directory, "variables/variables.index")
    blob = storage.blob.Blob.from_string(storage_path, client=storage.Client())
    blob.upload_from_filename(local_path)

    local_path = "saved_model/saved_model.pb"
    storage_path = os.path.join(model_directory, "saved_model.pb")
    blob = storage.blob.Blob.from_string(storage_path, client=storage.Client())
    blob.upload_from_filename(local_path)

    logger.info("Model Saved at " + model_directory)

    logger.info("Keras Score: " + str(model[1].history["accuracy"][-1]))

    hp_metric = model[1].history["accuracy"][-1]

    print("accuracy =", format_strs(hp_metric))

if __name__ == "__main__":
    main()

Dockerfile

# FROM gcr.io/deeplearning-platform-release/tf-cpu.2-8
FROM gcr.io/deeplearning-platform-release/tf-gpu.2-8

RUN mkdir -p /opt/trainer

# RUN pip install scikit-learn
RUN pip install tensorflow_text==2.8.1
# RUN pip install tensorflow-gpu==2.8.0

# CREDENTIAL Authentication
COPY /prj-vertex-ai-2c390f7e8fec.json /prj-vertex-ai-2c390f7e8fec.json
ENV GOOGLE_APPLICATION_CREDENTIALS="/prj-vertex-ai-2c390f7e8fec.json"

COPY *.py /opt/trainer/

# # RUN chgrp -R 0 /opt/trainer && chmod -R g+rwX /opt/trainer
# RUN chmod -R 777 /home/trainer

ENTRYPOINT ["python", "/opt/trainer/task.py"]

# Sets up the entry point to invoke the trainer.
# ENTRYPOINT ["python", "-m", "trainer.task"]

The pipeline runs but it doesnot use the GPU and this piece of code

logger.info("Num GPUs Available: " + str(tf.config.list_physical_devices('GPU')))
import subprocess
process = subprocess.Popen(['sh', '-c', 'nvidia-smi'], stdout=subprocess.PIPE, stderr=subprocess.PIPE)
out, err = process.communicate()
logger.info("NVIDIA SMI " + str(out))

gives empty list and empty string. It is like the GPU doesnot exist. I am attaching the logs of the container

insertId | labels."compute.googleapis.com/resource_name" | labels."k8s-pod/group-name" | labels."k8s-pod/job-name" | labels."k8s-pod/replica-index" | labels."k8s-pod/replica-type" | labels."k8s-pod/training_kubeflow_org/job-name" | labels."k8s-pod/training_kubeflow_org/operator-name" | labels."k8s-pod/training_kubeflow_org/replica-index" | labels."k8s-pod/training_kubeflow_org/replica-type" | logName | receiveLocation | receiveTimestamp | receivedLocation | resource.labels.cluster_name | resource.labels.container_name | resource.labels.location | resource.labels.namespace_name | resource.labels.pod_name | resource.labels.project_id | resource.type | severity | textPayload | timestamp
-- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | --
saaah727bfds9ymw | gke-kubeflow-pipelines-s-default-pool-e4e6dda3-544k | kubeflow.org | dbpedia-exp-1-ntq7tfvj | 0 | ps | dbpedia-exp-1-ntq7tfvj | tfjob-controller | 0 | ps | projects/prj-vertex-ai/logs/stdout | 2022-07-11T06:07:35.222632672Z | kubeflow-pipelines-standalone-v2 | tensorflow | us-central1-a | kubeflow | dbpedia-exp-1-ntq7tfvj-ps-0 | prj-vertex-ai | k8s_container | INFO | accuracy = +9.9e-1 | 2022-07-11T06:07:30.812554270Z
cg5hf72zfi4a8ymi | gke-kubeflow-pipelines-s-default-pool-e4e6dda3-544k | kubeflow.org | dbpedia-exp-1-ntq7tfvj | 0 | ps | dbpedia-exp-1-ntq7tfvj | tfjob-controller | 0 | ps | projects/prj-vertex-ai/logs/stderr | 2022-07-11T06:07:35.218143792Z | kubeflow-pipelines-standalone-v2 | tensorflow | us-central1-a | kubeflow | dbpedia-exp-1-ntq7tfvj-ps-0 | prj-vertex-ai | k8s_container | ERROR | INFO:root:Num GPUs Available: [] | 2022-07-11T06:07:30.812527036Z
0n32rintpe0v865p | gke-kubeflow-pipelines-s-default-pool-e4e6dda3-544k | kubeflow.org | dbpedia-exp-1-ntq7tfvj | 0 | ps | dbpedia-exp-1-ntq7tfvj | tfjob-controller | 0 | ps | projects/prj-vertex-ai/logs/stderr | 2022-07-11T06:07:35.218143792Z | kubeflow-pipelines-standalone-v2 | tensorflow | us-central1-a | kubeflow | dbpedia-exp-1-ntq7tfvj-ps-0 | prj-vertex-ai | k8s_container | ERROR | 2022-07-11 06:07:30.811609: I   tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does   not appear to be running on this host (dbpedia-exp-1-ntq7tfvj-ps-0):   /proc/driver/nvidia/version does not exist | 2022-07-11T06:07:30.812519914Z
et3b3w8ji0nlmfc3 | gke-kubeflow-pipelines-s-default-pool-e4e6dda3-544k | kubeflow.org | dbpedia-exp-1-ntq7tfvj | 0 | ps | dbpedia-exp-1-ntq7tfvj | tfjob-controller | 0 | ps | projects/prj-vertex-ai/logs/stderr | 2022-07-11T06:07:35.218143792Z | kubeflow-pipelines-standalone-v2 | tensorflow | us-central1-a | kubeflow | dbpedia-exp-1-ntq7tfvj-ps-0 | prj-vertex-ai | k8s_container | ERROR | 2022-07-11 06:07:30.811541: W   tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit:   UNKNOWN ERROR (303) | 2022-07-11T06:07:30.812511863Z
u8jhqsnsjg3n114l | gke-kubeflow-pipelines-s-default-pool-e4e6dda3-544k | kubeflow.org | dbpedia-exp-1-ntq7tfvj | 0 | ps | dbpedia-exp-1-ntq7tfvj | tfjob-controller | 0 | ps | projects/prj-vertex-ai/logs/stderr | 2022-07-11T06:07:35.218143792Z | kubeflow-pipelines-standalone-v2 | tensorflow | us-central1-a | kubeflow | dbpedia-exp-1-ntq7tfvj-ps-0 | prj-vertex-ai | k8s_container | ERROR | 2022-07-11 06:07:30.811461: W   tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load  /kind bug

What did you expect to happen:

I expected the pipeline stage to use GPU and run the text classiication using GPU but it doesnt.

Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.]

Environment:

Katib version (check the Katib controller image version): v0.13.0
Kubernetes version: (kubectl version): 1.22.8-gke.202
OS (uname -a): linux/ COS in containers

Impacted by this bug? Give it a 👍 We prioritize the issues with the most 👍

johnugeorge commented 2 years ago

The problem is with the image that you have created. It is not with Katib. Did you use GPU drivers in the image?

farisfirenze commented 2 years ago

I am able to execute "nvidia-smi" on the image and get the correct output. For this to happen, shouldnt the drivers be installed in the image ? Just to be sure, can you provide me with details on how to use GPU drivers in the image ?

johnugeorge commented 2 years ago

You can use Nvidia NGC containers based on your framework https://catalog.ngc.nvidia.com/containers

farisfirenze commented 2 years ago

You can use Nvidia NGC containers based on your framework https://catalog.ngc.nvidia.com/containers

I have tried using the Nvidia NGC containers as mentioned below

FROM nvcr.io/nvidia/tensorflow:22.06-tf2-py3

RUN mkdir -p /opt/trainer

RUN pip show tensorflow
RUN pip install pandas
RUN pip install scikit-learn
RUN pip install google-cloud-storage

# CREDENTIAL Authentication
COPY /prj-vertex-ai-2c390f7e8fec.json /prj-vertex-ai-2c390f7e8fec.json
ENV GOOGLE_APPLICATION_CREDENTIALS="/prj-vertex-ai-2c390f7e8fec.json"

COPY *.py /opt/trainer/

ENTRYPOINT ["python", "/opt/trainer/task.py"]

PS : I have pulled the same image in both the containers of my pipeline

but I am still getting this problem

Also, I have a question. I am setting GPU limit on my pipeline component using .set_gpu_limit(1) as given below.

hp_tune = dsl.ContainerOp(
          name='hp-tune-katib',
          image=hyper_image_uri,
          command=["python3", "/hp_tune/task.py"],
          arguments=[
            '--experiment_name', experiment_name,
            '--experiment_namespace', experiment_namespace,
            '--experiment_timeout_minutes', experiment_timeout_minutes,
            '--delete_after_done', True,
            '--hyper_image_uri', hyper_image_uri_train,
            '--time_loc', time_loc, 
            '--model_uri', model_uri

          ],
          file_outputs={'best-params': '/output.txt'}
        ).set_gpu_limit(1)

and the ARGO_CONTAINER is showing nvidia.com/gpu : 1

So, my question is that, Do I need to specify GPU request on my trial spec in katib as well like below ?

trial_spec = {
        "apiVersion": "kubeflow.org/v1",
        "kind": "TFJob",
        "spec": {
            "tfReplicaSpecs": {
                "PS": {
                    "replicas": 1,
                    "restartPolicy": "Never",
                    "template": {
                        "metadata": {
                            "annotations": {
                                "sidecar.istio.io/inject": "false",
                            }
                        },
                        "spec": {
                            "containers": [
                                {
                                    "name": "tensorflow",
                                    "image": args.hyper_image_uri,
                                    "command": [
                                        "python",
                                        "/opt/trainer/task.py",
                                        "--model_uri=" + args.model_uri,
                                        "--batch_size=${trialParameters.batchSize}",
                                        "--learning_rate=${trialParameters.learningRate}"

                                    ],
                                    "ports" : [
                                        {
                                            "containerPort": 2222,
                                            "name" : "tfjob-port"
                                        }
                                    ]
                                }
                            ]
                        }
                    }
                },
                "Worker": {
                    "replicas": 1,
                    "restartPolicy": "Never",
                    "template": {
                        "metadata": {
                            "annotations": {
                                "sidecar.istio.io/inject": "false",
                            }
                        },
                        "spec": {
                            "containers": [
                                {
                                    "name": "tensorflow",
                                    "image": args.hyper_image_uri,
                                    "command": [
                                        "python",
                                        "/opt/trainer/task.py",
                                        "--model_uri=" + args.model_uri,
                                        "--batch_size=${trialParameters.batchSize}",
                                        "--learning_rate=${trialParameters.learningRate}"
                                    ],
                                    "ports" : [
                                        {
                                            "containerPort": 2222,
                                            "name" : "tfjob-port"
                                        }
                                    ],
                                    "resources" : {
                                        "limits" : {
                                            "nvidia.com/gpu" : 1
                                        }
                                    }
                                }
                            ]
                        }
                    }
                }
            }
        }
    }

Also, I kindly request you to help me solve this GPU usage problem.

johnugeorge commented 2 years ago

I haven't tried gpu limit with Pipelines.

Easiest way is to check the experiment yaml using kubectl. Trial Spec should need gpu limit if trial pod needs to access GPU.

farisfirenze commented 2 years ago

This is what happens when I specify GPU request in the trial spec but not in the pipeline component.

This step is in Pending state with this message: Unschedulable: 0/2 nodes are available: 1 Insufficient cpu, 1 Insufficient memory, 1 node(s) had taint {nvidia.com/gpu: present}, that the pod didn't tolerate.

This is my kubectl describe node

(base) jupyter@tensorflow-2-6-new:~/katib/dbpedia$ kubectl describe node gke-kubeflow-pipelines-stan-gpu-pool1-a65c281b-4r3j
Name:               gke-kubeflow-pipelines-stan-gpu-pool1-a65c281b-4r3j
Roles:              <none>
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/instance-type=n1-highmem-8
                    beta.kubernetes.io/os=linux
                    cloud.google.com/gke-accelerator=nvidia-tesla-k80
                    cloud.google.com/gke-boot-disk=pd-standard
                    cloud.google.com/gke-container-runtime=containerd
                    cloud.google.com/gke-cpu-scaling-level=8
                    cloud.google.com/gke-max-pods-per-node=110
                    cloud.google.com/gke-nodepool=gpu-pool1
                    cloud.google.com/gke-os-distribution=cos
                    cloud.google.com/machine-family=n1
                    failure-domain.beta.kubernetes.io/region=us-central1
                    failure-domain.beta.kubernetes.io/zone=us-central1-a
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=gke-kubeflow-pipelines-stan-gpu-pool1-a65c281b-4r3j
                    kubernetes.io/os=linux
                    node.kubernetes.io/instance-type=n1-highmem-8
                    topology.gke.io/zone=us-central1-a
                    topology.kubernetes.io/region=us-central1
                    topology.kubernetes.io/zone=us-central1-a
Annotations:        container.googleapis.com/instance_id: 609271750101604849
                    csi.volume.kubernetes.io/nodeid:
                      {"pd.csi.storage.gke.io":"projects/prj-vertex-ai/zones/us-central1-a/instances/gke-kubeflow-pipelines-stan-gpu-pool1-a65c281b-4r3j"}
                    node.alpha.kubernetes.io/ttl: 0
                    node.gke.io/last-applied-node-labels:
                      cloud.google.com/gke-accelerator=nvidia-tesla-k80,cloud.google.com/gke-boot-disk=pd-standard,cloud.google.com/gke-container-runtime=contai...
                    node.gke.io/last-applied-node-taints: nvidia.com/gpu=present:NoSchedule
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Fri, 15 Jul 2022 08:37:52 +0000
Taints:             nvidia.com/gpu=present:NoSchedule
Unschedulable:      false
Lease:
  HolderIdentity:  gke-kubeflow-pipelines-stan-gpu-pool1-a65c281b-4r3j
  AcquireTime:     <unset>
  RenewTime:       Fri, 15 Jul 2022 08:52:28 +0000
Conditions:
  Type                          Status  LastHeartbeatTime                 LastTransitionTime                Reason                          Message
  ----                          ------  -----------------                 ------------------                ------                          -------
  CorruptDockerOverlay2         False   Fri, 15 Jul 2022 08:48:00 +0000   Fri, 15 Jul 2022 08:37:57 +0000   NoCorruptDockerOverlay2         docker overlay2 is functioning properly
  FrequentUnregisterNetDevice   False   Fri, 15 Jul 2022 08:48:00 +0000   Fri, 15 Jul 2022 08:37:57 +0000   NoFrequentUnregisterNetDevice   node is functioning properly
  FrequentKubeletRestart        False   Fri, 15 Jul 2022 08:48:00 +0000   Fri, 15 Jul 2022 08:37:57 +0000   NoFrequentKubeletRestart        kubelet is functioning properly
  FrequentDockerRestart         False   Fri, 15 Jul 2022 08:48:00 +0000   Fri, 15 Jul 2022 08:37:57 +0000   NoFrequentDockerRestart         docker is functioning properly
  FrequentContainerdRestart     False   Fri, 15 Jul 2022 08:48:00 +0000   Fri, 15 Jul 2022 08:37:57 +0000   NoFrequentContainerdRestart     containerd is functioning properly
  KernelDeadlock                False   Fri, 15 Jul 2022 08:48:00 +0000   Fri, 15 Jul 2022 08:37:57 +0000   KernelHasNoDeadlock             kernel has no deadlock
  ReadonlyFilesystem            False   Fri, 15 Jul 2022 08:48:00 +0000   Fri, 15 Jul 2022 08:37:57 +0000   FilesystemIsNotReadOnly         Filesystem is not read-only
  NetworkUnavailable            False   Fri, 15 Jul 2022 08:37:52 +0000   Fri, 15 Jul 2022 08:37:52 +0000   RouteCreated                    NodeController create implicit route
  MemoryPressure                False   Fri, 15 Jul 2022 08:49:24 +0000   Fri, 15 Jul 2022 08:37:49 +0000   KubeletHasSufficientMemory      kubelet has sufficient memory available
  DiskPressure                  False   Fri, 15 Jul 2022 08:49:24 +0000   Fri, 15 Jul 2022 08:37:49 +0000   KubeletHasNoDiskPressure        kubelet has no disk pressure
  PIDPressure                   False   Fri, 15 Jul 2022 08:49:24 +0000   Fri, 15 Jul 2022 08:37:49 +0000   KubeletHasSufficientPID         kubelet has sufficient PID available
  Ready                         True    Fri, 15 Jul 2022 08:49:24 +0000   Fri, 15 Jul 2022 08:37:52 +0000   KubeletReady                    kubelet is posting ready status. AppArmor enabled
Addresses:
  InternalIP:   10.128.0.14
  ExternalIP:   34.171.4.196
  InternalDNS:  gke-kubeflow-pipelines-stan-gpu-pool1-a65c281b-4r3j.us-central1-a.c.prj-vertex-ai.internal
  Hostname:     gke-kubeflow-pipelines-stan-gpu-pool1-a65c281b-4r3j.us-central1-a.c.prj-vertex-ai.internal
Capacity:
  attachable-volumes-gce-pd:  127
  cpu:                        8
  ephemeral-storage:          98868448Ki
  hugepages-1Gi:              0
  hugepages-2Mi:              0
  memory:                     53477620Ki
  nvidia.com/gpu:             1
  pods:                       110
Allocatable:
  attachable-volumes-gce-pd:  127
  cpu:                        7910m
  ephemeral-storage:          47093746742
  hugepages-1Gi:              0
  hugepages-2Mi:              0
  memory:                     48425204Ki
  nvidia.com/gpu:             1
  pods:                       110
System Info:
  Machine ID:                 27109359572b62f3c535daadb9e9c398
  System UUID:                27109359-572b-62f3-c535-daadb9e9c398
  Boot ID:                    cb1e0e37-2556-4f81-b0a8-b93a5105f484
  Kernel Version:             5.10.90+
  OS Image:                   Container-Optimized OS from Google
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  containerd://1.5.4
  Kubelet Version:            v1.22.8-gke.202
  Kube-Proxy Version:         v1.22.8-gke.202
PodCIDR:                      10.8.1.0/24
PodCIDRs:                     10.8.1.0/24
ProviderID:                   gce://prj-vertex-ai/us-central1-a/gke-kubeflow-pipelines-stan-gpu-pool1-a65c281b-4r3j
Non-terminated Pods:          (6 in total)
  Namespace                   Name                                                              CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----                                                              ------------  ----------  ---------------  -------------  ---
  kube-system                 fluentbit-gke-kjmds                                               100m (1%)     0 (0%)      200Mi (0%)       500Mi (1%)     14m
  kube-system                 gke-metrics-agent-zqm94                                           3m (0%)       0 (0%)      50Mi (0%)        50Mi (0%)      14m
  kube-system                 kube-proxy-gke-kubeflow-pipelines-stan-gpu-pool1-a65c281b-4r3j    100m (1%)     0 (0%)      0 (0%)           0 (0%)         14m
  kube-system                 nvidia-driver-installer-hw2lx                                     150m (1%)     0 (0%)      0 (0%)           0 (0%)         14m
  kube-system                 nvidia-gpu-device-plugin-ln587                                    50m (0%)      0 (0%)      50Mi (0%)        50Mi (0%)      14m
  kube-system                 pdcsi-node-2nlmc                                                  10m (0%)      0 (0%)      20Mi (0%)        100Mi (0%)     14m
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                   Requests    Limits
  --------                   --------    ------
  cpu                        413m (5%)   0 (0%)
  memory                     320Mi (0%)  700Mi (1%)
  ephemeral-storage          0 (0%)      0 (0%)
  hugepages-1Gi              0 (0%)      0 (0%)
  hugepages-2Mi              0 (0%)      0 (0%)
  attachable-volumes-gce-pd  0           0
  nvidia.com/gpu             0           0
Events:
  Type     Reason                   Age                From             Message
  ----     ------                   ----               ----             -------
  Normal   Starting                 14m                kube-proxy       
  Normal   Starting                 14m                kubelet          Starting kubelet.
  Normal   NodeHasSufficientMemory  14m (x4 over 14m)  kubelet          Node gke-kubeflow-pipelines-stan-gpu-pool1-a65c281b-4r3j status is now: NodeHasSufficientMemory
  Normal   NodeHasNoDiskPressure    14m (x4 over 14m)  kubelet          Node gke-kubeflow-pipelines-stan-gpu-pool1-a65c281b-4r3j status is now: NodeHasNoDiskPressure
  Normal   NodeHasSufficientPID     14m (x4 over 14m)  kubelet          Node gke-kubeflow-pipelines-stan-gpu-pool1-a65c281b-4r3j status is now: NodeHasSufficientPID
  Normal   NodeAllocatableEnforced  14m                kubelet          Updated Node Allocatable limit across pods
  Warning  InvalidDiskCapacity      14m                kubelet          invalid capacity 0 on image filesystem
  Normal   NodeReady                14m                kubelet          Node gke-kubeflow-pipelines-stan-gpu-pool1-a65c281b-4r3j status is now: NodeReady
  Warning  ContainerdStart          14m (x2 over 14m)  systemd-monitor  Starting containerd container runtime...
  Warning  DockerStart              14m (x3 over 14m)  systemd-monitor  Starting Docker Application Container Engine...
  Warning  KubeletStart             14m                systemd-monitor  Started Kubernetes kubelet.

Any idea how I can add toleration to this taint and make the pod allocate GPU ?

This is my pod yaml

(base) jupyter@tensorflow-2-6-new:~/katib/dbpedia/hp_tune$ kubectl get pods kubectl get pod dbpedia-exp-8-g4pvh4fc-worker-0 -o yaml -n kubeflow
apiVersion: v1
items:
- apiVersion: v1
  kind: Pod
  metadata:
    annotations:
      sidecar.istio.io/inject: "false"
    creationTimestamp: "2022-07-15T09:57:26Z"
    labels:
      group-name: kubeflow.org
      job-name: dbpedia-exp-8-g4pvh4fc
      replica-index: "0"
      replica-type: worker
      training.kubeflow.org/job-name: dbpedia-exp-8-g4pvh4fc
      training.kubeflow.org/job-role: master
      training.kubeflow.org/operator-name: tfjob-controller
      training.kubeflow.org/replica-index: "0"
      training.kubeflow.org/replica-type: worker
    name: dbpedia-exp-8-g4pvh4fc-worker-0
    namespace: kubeflow
    ownerReferences:
    - apiVersion: kubeflow.org/v1
      blockOwnerDeletion: true
      controller: true
      kind: TFJob
      name: dbpedia-exp-8-g4pvh4fc
      uid: 7401591a-e7f3-4036-823e-b63437fed795
    resourceVersion: "39305"
    uid: 5b974f29-4379-41ff-90dd-b51c6d04d189
  spec:
    containers:
    - args:
      - python /opt/trainer/task.py --model_uri=gs://faris_bucket_us_central/Pipeline_data/dbpedia_hyper_models/GKE_Katib/
        --batch_size=32 --learning_rate=0.004570666890885507 1>/var/log/katib/metrics.log
        2>&1 && echo completed > /var/log/katib/$$$$.pid
      command:
      - sh
      - -c
      env:
      - name: TF_CONFIG
        value: '{"cluster":{"ps":["dbpedia-exp-8-g4pvh4fc-ps-0.kubeflow.svc:2222"],"worker":["dbpedia-exp-8-g4pvh4fc-worker-0.kubeflow.svc:2222"]},"task":{"type":"worker","index":0},"environment":"cloud"}'
      image: gcr.io/........./hptunekatib:v14
      imagePullPolicy: IfNotPresent
      name: tensorflow
      ports:
      - containerPort: 2222
        name: tfjob-port
        protocol: TCP
      resources:
        limits:
          nvidia.com/gpu: "1"
        requests:
          nvidia.com/gpu: "1"
      terminationMessagePath: /dev/termination-log
      terminationMessagePolicy: File
      volumeMounts:
      - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
        name: kube-api-access-xvtgc
        readOnly: true
      - mountPath: /var/log/katib
        name: metrics-volume
    - args:
      - -t
      - dbpedia-exp-8-g4pvh4fc
      - -m
      - accuracy
      - -o-type
      - maximize
      - -s-db
      - katib-db-manager.kubeflow:6789
      - -path
      - /var/log/katib/metrics.log
      image: docker.io/kubeflowkatib/file-metrics-collector:v0.13.0
      imagePullPolicy: IfNotPresent
      name: metrics-logger-and-collector
      resources:
        limits:
          cpu: 500m
          ephemeral-storage: 5Gi
          memory: 100Mi
        requests:
          cpu: 50m
          ephemeral-storage: 500Mi
          memory: 10Mi
      terminationMessagePath: /dev/termination-log
      terminationMessagePolicy: File
      volumeMounts:
      - mountPath: /var/log/katib
        name: metrics-volume
      - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
        name: kube-api-access-xvtgc
        readOnly: true
    dnsPolicy: ClusterFirst
    enableServiceLinks: true
    preemptionPolicy: PreemptLowerPriority
    priority: 0
    restartPolicy: Never
    schedulerName: default-scheduler
    securityContext: {}
    serviceAccount: default
    serviceAccountName: default
    shareProcessNamespace: true
    terminationGracePeriodSeconds: 30
    tolerations:
    - effect: NoSchedule
      key: example-key
      operator: Exists
    - effect: NoExecute
      key: node.kubernetes.io/not-ready
      operator: Exists
      tolerationSeconds: 300
    - effect: NoExecute
      key: node.kubernetes.io/unreachable
      operator: Exists
      tolerationSeconds: 300
    - effect: NoSchedule
      key: nvidia.com/gpu
      operator: Exists
    volumes:
    - name: kube-api-access-xvtgc
      projected:
        defaultMode: 420
        sources:
        - serviceAccountToken:
            expirationSeconds: 3607
            path: token
        - configMap:
            items:
            - key: ca.crt
              path: ca.crt
            name: kube-root-ca.crt
        - downwardAPI:
            items:
            - fieldRef:
                apiVersion: v1
                fieldPath: metadata.namespace
              path: namespace
    - emptyDir: {}
      name: metrics-volume
  status:
    conditions:
    - lastProbeTime: null
      lastTransitionTime: "2022-07-15T09:57:26Z"
      message: '0/2 nodes are available: 2 Insufficient nvidia.com/gpu.'
      reason: Unschedulable
      status: "False"
      type: PodScheduled
    phase: Pending
    qosClass: Burstable

and this is my katib experiment yaml

(base) jupyter@tensorflow-2-6-new:~/katib/dbpedia/hp_tune$ kubectl get experiment dbpedia-exp-8 -o yaml -n kubeflow
apiVersion: kubeflow.org/v1beta1
kind: Experiment
metadata:
  creationTimestamp: "2022-07-15T09:57:05Z"
  finalizers:
  - update-prometheus-metrics
  generation: 1
  name: dbpedia-exp-8
  namespace: kubeflow
  resourceVersion: "39293"
  uid: ded49060-e00e-4b57-8fd1-f40af2ec162e
spec:
  algorithm:
    algorithmName: random
  maxFailedTrialCount: 2
  maxTrialCount: 2
  metricsCollectorSpec:
    collector:
      kind: StdOut
  objective:
    metricStrategies:
    - name: accuracy
      value: max
    objectiveMetricName: accuracy
    type: maximize
  parallelTrialCount: 1
  parameters:
  - feasibleSpace:
      list:
      - "32"
      - "42"
      - "52"
      - "62"
      - "64"
    name: batch_size
    parameterType: discrete
  - feasibleSpace:
      max: "0.005"
      min: "0.001"
    name: learning_rate
    parameterType: double
  resumePolicy: LongRunning
  trialTemplate:
    failureCondition: status.conditions.#(type=="Failed")#|#(status=="True")#
    primaryContainerName: tensorflow
    primaryPodLabels:
      training.kubeflow.org/job-role: master
    successCondition: status.conditions.#(type=="Succeeded")#|#(status=="True")#
    trialParameters:
    - description: batch size
      name: batchSize
      reference: batch_size
    - description: Learning rate
      name: learningRate
      reference: learning_rate
    trialSpec:
      apiVersion: kubeflow.org/v1
      kind: TFJob
      spec:
        tfReplicaSpecs:
          PS:
            replicas: 1
            restartPolicy: Never
            template:
              metadata:
                annotations:
                  sidecar.istio.io/inject: "false"
              spec:
                containers:
                - command:
                  - python
                  - /opt/trainer/task.py
                  - --model_uri=gs://faris_bucket_us_central/Pipeline_data/dbpedia_hyper_models/GKE_Katib/
                  - --batch_size=${trialParameters.batchSize}
                  - --learning_rate=${trialParameters.learningRate}
                  image: gcr.io/............/hptunekatib:v14
                  name: tensorflow
                  ports:
                  - containerPort: 2222
                    name: tfjob-port
          Worker:
            replicas: 1
            restartPolicy: Never
            template:
              metadata:
                annotations:
                  sidecar.istio.io/inject: "false"
              spec:
                containers:
                - command:
                  - python
                  - /opt/trainer/task.py
                  - --model_uri=gs://faris_bucket_us_central/Pipeline_data/dbpedia_hyper_models/GKE_Katib/
                  - --batch_size=${trialParameters.batchSize}
                  - --learning_rate=${trialParameters.learningRate}
                  image: gcr.io/........./hptunekatib:v14
                  name: tensorflow
                  ports:
                  - containerPort: 2222
                    name: tfjob-port
                  resources:
                    limits:
                      nvidia.com/gpu: 1
                tolerations:
                - effect: NoSchedule
                  key: example-key
                  operator: Exists
status:
  conditions:
  - lastTransitionTime: "2022-07-15T09:57:05Z"
    lastUpdateTime: "2022-07-15T09:57:05Z"
    message: Experiment is created
    reason: ExperimentCreated
    status: "True"
    type: Created
  - lastTransitionTime: "2022-07-15T09:57:26Z"
    lastUpdateTime: "2022-07-15T09:57:26Z"
    message: Experiment is running
    reason: ExperimentRunning
    status: "True"
    type: Running
  currentOptimalTrial:
    observation: {}
  runningTrialList:
  - dbpedia-exp-8-g4pvh4fc
  startTime: "2022-07-15T09:57:05Z"
  trials: 1
  trialsRunning: 1

even though it shows running.. it will timeout eventually.

What am I missing here ?

johnugeorge commented 2 years ago

This is not specific to Katib. It means that trials could not find a node which satisfies these resource requirements to start the pod One thing to note: When you add resource requirements to trial spec, every trial pod will try to request the same set of resources when run in parallel. Eg: If trialSpec has 1 GPU requirement and if experimentSpec allows 3 parallelTrials, then each trial pod will request 1 GPU each(total of 3 GPUs)

AlexandreBrown commented 2 years ago

Here is the gist of my working sample, you can ignore the node selector stuff, it just helps to schedule the pod on the gpu node I want (dedicated for training in my case) :

trial_spec={
        "apiVersion": "batch/v1",
        "kind": "Job",
        "spec": {
            "template": {
                "metadata": {
                    "annotations": {
                         "sidecar.istio.io/inject": "false"
                    }
                },
                "spec": {
                    "affinity": {
                        "nodeAffinity": {
                            "requiredDuringSchedulingIgnoredDuringExecution": {
                                "nodeSelectorTerms": [
                                    {
                                        "matchExpressions": [
                                            {
                                                "key": "k8s.amazonaws.com/accelerator",
                                                "operator": "In",
                                                "values": [
                                                    "nvidia-tesla-v100"
                                                ]
                                            },
                                            {
                                                "key": "ai-gpu-2",
                                                "operator": "In",
                                                "values": [
                                                    "true"
                                                ]
                                            }
                                        ]
                                    }
                                ]
                            }

                        }
                    },
                    "containers": [
                        {
                            "resources" : {
                                "limits" : {
                                    "nvidia.com/gpu" : 1
                                }
                            },
                            "name": training_container_name,
                            "image": "xxxxxxxxxxxxxxxxxxxxx__YOUR_IMAGE_HERE_xxxxxxxxxxxxxx",
                            "imagePullPolicy": "Always",
                            "command": train_params + [
                                "--learning_rate=${trialParameters.learning_rate}",
                                "--optimizer=${trialParameters.optimizer}",
                                "--batch_size=${trialParameters.batch_size}",
                                "--max_epochs=${trialParameters.max_epochs}"
                            ]
                        }
                    ],
                    "restartPolicy": "Never",
                    "serviceAccountName": "default-editor"
                }
            }
        }
    }

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

andreyvelich commented 1 year ago

Feel free to re-open an issue if you have any followup problems.

kubeflow / katib

GPU not consuming for Katib experiment - GKE Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory #1915