mChowdhury-91 commented 1 year ago

/kind bug

What steps did you take and what happened: I have been trying to create a simple Katib experiment with sklearn iris dataset but am facing an error "Objective metric accuracy is not found in training logs, unavailable value is reported. metric:<name:"accuracy" value:"unavailable"

Below is my code: import argparse import os import hypertune import logging import pandas as pd

YOUR IMPORTS HERE

from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.neighbors import KNeighborsClassifier

def main(): parser = argparse.ArgumentParser() parser.add_argument('--neighbors', type=int, default=3, help='value of k') parser.add_argument("--log-path", type=str, default="", help="Path to save logs. Print to StdOut if log-path is not set") parser.add_argument("--logger", type=str, choices=["standard", "hypertune"], help="Logger", default="standard") args = parser.parse_args()

if args.log_path == "" or args.logger == "hypertune":
    logging.basicConfig(
        format="%(asctime)s %(levelname)-8s %(message)s",
        datefmt="%Y-%m-%dT%H:%M:%SZ",
        level=logging.DEBUG)
else:
    logging.basicConfig(
        format="%(asctime)s %(levelname)-8s %(message)s",
        datefmt="%Y-%m-%dT%H:%M:%SZ",
        level=logging.DEBUG,
        filename=args.log_path)

if args.logger == "hypertune" and args.log_path != "":
    os.environ['CLOUD_ML_HP_METRIC_FILE'] = args.log_path

# For JSON logging
hpt = hypertune.HyperTune()

# LOAD DATA HERE
iris_data = load_iris()
iris_df = pd.DataFrame(data=iris_data['data'], columns=iris_data['feature_names'])
iris_df['Iris type'] = iris_data['target']
iris_df['Iris name'] = iris_df['Iris type'].apply(
    lambda x: 'sentosa' if x == 0 else ('versicolor' if x == 1 else 'virginica'))

def f(x):
    if x == 0:
        val = 'setosa'
    elif x == 1:
        val = 'versicolor'
    else:
        val = 'virginica'
    return val

iris_df['test'] = iris_df['Iris type'].apply(f)
iris_df.drop(['test'], axis=1, inplace=True)

X = iris_df[['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']]
y = iris_df['Iris name']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
knn = KNeighborsClassifier(n_neighbors=args.neighbors)
knn.fit(X_train, y_train)
accuracy = knn.score(X_test, y_test)

logging.info("{{metricName: accuracy, metricValue: {:.4f}}}\n".format(accuracy))

if args.logger == "hypertune":
    hpt.report_hyperparameter_tuning_metric(
        hyperparameter_metric_tag='accuracy',
        metric_value=accuracy)

if name == 'main': main()

Below is my yaml file:

apiVersion: kubeflow.org/v1beta1 kind: Experiment metadata: namespace: kubeflow name: iris-1 spec: parallelTrialCount: 1 maxTrialCount: 2 maxFailedTrialCount: 3 objective: type: maximize goal: 0.99 objectiveMetricName: accuracy metricsCollectorSpec: collector: kind: StdOut algorithm: algorithmName: random parameters:

name: neighbors parameterType: int feasibleSpace: min: "3" max: "5" trialTemplate: primaryContainerName: training-container trialParameters:
- name: neighbors description: KNN neighbors reference: neighbors trialSpec: apiVersion: batch/v1 kind: Job spec: template: metadata: annotations: sidecar.istio.io/inject: "false" spec: containers:
  - name: training-container image: e-dpiac-docker-local.docker.lowes.com/katib-sklearn:v3 command:
    - "python3"
    - "/app/iris.py"
    - "--neighbors=${trialParameters.neighbors}"
    - "--logger=hypertune" resources: requests: memory: "6Gi" cpu: "2" limits: memory: "10Gi" cpu: "4" restartPolicy: Never

What did you expect to happen: The metrics should have been collected and the trials should have succeeded..

Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.]

Environment:

Katib version (check the Katib controller image version): katib-controller:v0.12.0
Kubernetes version: (kubectl version):
OS (uname -a):

Impacted by this bug? Give it a 👍 We prioritize the issues with the most 👍

andreyvelich commented 1 year ago

Hi @mChowdhury-91,

Since you use the default metrics collector you should print your metrics in the following format:

logging.info(f"accuracy={accuracy}")

You can find the default format regex for Metrics Collector here: https://www.kubeflow.org/docs/components/katib/experiment/#metrics-collector

If you want to use Hypertune logging feature that reports logs as JSON, you need to use the following Metrics Collector Spec.. cc @tenzen-y

tenzen-y commented 1 year ago

@andreyvelich Thanks for sending me ping. @mChowdhury-91 Also, we can see a sample using the hypertune here: https://github.com/kubeflow/katib/blob/89bd21f710fb4cd153a33c94e8892ec079cf63c8/examples/v1beta1/trial-images/pytorch-mnist/mnist.py#L85-L93

mChowdhury-91 commented 1 year ago

@andreyvelich Thanks for your reply. I tried the same, still no luck. Here's my updated code:

def main(): parser = argparse.ArgumentParser() parser.add_argument('--neighbors', type=int, default=3, help='value of k') parser.add_argument("--log-path", type=str, default="", help="Path to save logs. Print to StdOut if log-path is not set") args = parser.parse_args()

# LOAD DATA HERE
iris_data = load_iris()
iris_df = pd.DataFrame(data=iris_data['data'], columns=iris_data['feature_names'])
iris_df['Iris type'] = iris_data['target']
iris_df['Iris name'] = iris_df['Iris type'].apply(
    lambda x: 'sentosa' if x == 0 else ('versicolor' if x == 1 else 'virginica'))

def f(x):
    if x == 0:
        val = 'setosa'
    elif x == 1:
        val = 'versicolor'
    else:
        val = 'virginica'
    return val

iris_df['test'] = iris_df['Iris type'].apply(f)
iris_df.drop(['test'], axis=1, inplace=True)

X = iris_df[['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']]
y = iris_df['Iris name']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
knn = KNeighborsClassifier(n_neighbors=args.neighbors)
knn.fit(X_train, y_train)
accuracy = knn.score(X_test, y_test)

logging.info(f"accuracy={accuracy}")

if name == 'main': main()

Error is still the same: metric:<name:"accuracy" value:"unavailable" > >

tenzen-y commented 1 year ago

@mChowdhury-91 Can you share the fixed Experiment manifest?

mChowdhury-91 commented 1 year ago

@tenzen-y This is my yaml file:

apiVersion: kubeflow.org/v1beta1 kind: Experiment metadata: namespace: kubeflow name: iris-log spec: parallelTrialCount: 3 maxTrialCount: 12 maxFailedTrialCount: 3 objective: type: maximize goal: 0.99 objectiveMetricName: accuracy metricsCollectorSpec: collector: kind: StdOut algorithm: algorithmName: random parameters:

name: neighbors parameterType: int feasibleSpace: min: "3" max: "7" trialTemplate: primaryContainerName: training-container trialParameters:
- name: neighbors description: KNN neighbors reference: neighbors trialSpec: apiVersion: batch/v1 kind: Job spec: template: metadata: annotations: sidecar.istio.io/inject: "false" spec: containers:
  - name: training-container image: e-dpiac-docker-local.docker.lowes.com/katib-sklearn:v4 command:
    - "python3"
    - "/app/iris_log.py"
    - "--neighbors=${trialParameters.neighbors}" resources: requests: memory: "6Gi" cpu: "2" limits: memory: "10Gi" cpu: "4" restartPolicy: Never

tenzen-y commented 1 year ago

@mChowdhury-91 Thanks. You need to modify the metricsCollectorSpec like this:

https://github.com/kubeflow/katib/blob/89bd21f710fb4cd153a33c94e8892ec079cf63c8/examples/v1beta1/metrics-collector/file-metrics-collector-with-json-format.yaml#L14-L21

mChowdhury-91 commented 1 year ago

@tenzen-y I'm trying to log to default StdOut default metrics collector using logging.info(f"accuracy={accuracy}") as suggested by @andreyvelich and not the hypertune(log to JSON)

tenzen-y commented 1 year ago

@tenzen-y I'm trying to log to default StdOut default metrics collector using logging.info(f"accuracy={accuracy}") as suggested by @andreyvelich and not the hypertune(log to JSON)

I see. If you use the StdOut collector, you need to create a regexp to express the logs. https://regex101.com/ might be helpful.

mChowdhury-91 commented 1 year ago

@tenzen-y Do you have any example ? What should I write in the regexp ?

tenzen-y commented 1 year ago

@tenzen-y Do you have any example ? What should I write in the regexp ?

Katib default filter is here: https://github.com/kubeflow/katib/blob/89bd21f710fb4cd153a33c94e8892ec079cf63c8/pkg/metricscollector/v1beta1/common/const.go#L39-L47

andreyvelich commented 1 year ago

I think, your Experiment should work with StdOut Metrics Collector @mChowdhury-91 . Please can you share logs from one of your Trials ?

mChowdhury-91 commented 1 year ago

@andreyvelich This is the log from the Trial Pod with : logging.info(f"accuracy={accuracy}")

I0720 11:20:33.136977 29 main.go:342] Trial Name: iris-log-8xnppxc5 I0720 11:20:36.148952 29 file-metricscollector.go:118] Objective metric accuracy is not found in training logs, unavailable value is reported I0720 11:20:36.158568 29 main.go:399] Metrics reported. : metric_logs:<time_stamp:"0001-01-01T00:00:00Z" metric:<name:"accuracy" value:"unavailable" > >

tenzen-y commented 1 year ago

@andreyvelich This is the log from the Trial Pod with : logging.info(f"accuracy={accuracy}")

I0720 11:20:33.136977 29 main.go:342] Trial Name: iris-log-8xnppxc5 I0720 11:20:36.148952 29 file-metricscollector.go:118] Objective metric accuracy is not found in training logs, unavailable value is reported I0720 11:20:36.158568 29 main.go:399] Metrics reported. : metric_logs:<time_stamp:"0001-01-01T00:00:00Z" metric:<name:"accuracy" value:"unavailable" > >

Can you get logs from pods with --all-containers options?

andreyvelich commented 1 year ago

Maybe your logs are written to the log_path ? Can you just try to run your code locally with some test parameters to check if logs are printed to the stdout ?

mChowdhury-91 commented 1 year ago

Maybe your logs are written to the log_path ? Can you just try to run your code locally with some test parameters to check if logs are printed to the stdout ?

Output when I run this: python3 iris.py 2023-07-20T17:13:38Z INFO accuracy=0.9736842105263158

andreyvelich commented 1 year ago

@mChowdhury-91 Please can you describe your Trial pod for me ?

kubectl get pod <trial-pod-name> -n kubeflow -o yaml

mChowdhury-91 commented 1 year ago

kubectl get pod -n kubeflow -o yaml

apiVersion: v1 kind: Pod metadata: annotations: cni.projectcalico.org/containerID: 11204fb3ae5fd95859bbd9e159480fd52d5da787f6ad4088b994cdd0ea5a8ce7 cni.projectcalico.org/podIP: "" cni.projectcalico.org/podIPs: "" sidecar.istio.io/inject: "false" creationTimestamp: "2023-07-20T11:20:31Z" generateName: iris-log-8xnppxc5- labels: controller-uid: a7f8ff3e-1013-485c-bf23-ef0caa0cb765 job-name: iris-log-8xnppxc5 name: iris-log-8xnppxc5-ddbkr namespace: mlp-profile ownerReferences:

apiVersion: batch/v1 blockOwnerDeletion: true controller: true kind: Job name: iris-log-8xnppxc5 uid: a7f8ff3e-1013-485c-bf23-ef0caa0cb765 resourceVersion: "427182575" uid: 68a5f4fa-cfb2-4e2e-bc26-5d7b4d501471 spec: containers:
args:
- python3 /app/iris_log.py --neighbors=7 1>/var/log/katib/metrics.log 2>&1 && echo completed > /var/log/katib/$$$$.pid command:
- sh
- -c image: e-dpiac-docker-local.docker.lowes.com/katib-sklearn:v4 imagePullPolicy: IfNotPresent name: training-container resources: limits: cpu: "4" memory: 10Gi requests: cpu: "2" memory: 6Gi terminationMessagePath: /dev/termination-log terminationMessagePolicy: File volumeMounts:
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount name: kube-api-access-dbwd6 readOnly: true
- mountPath: /var/log/katib name: metrics-volume
args:
- -t
- iris-log-8xnppxc5
- -m
- accuracy
- -o-type
- maximize
- -s-db
- katib-db-manager.kubeflow:6789
- -path
- /var/log/katib/metrics.log image: e-ppsc-docker-virtual.docker.lowes.com/kubeflowkatib/file-metrics-collector:v0.12.0 imagePullPolicy: IfNotPresent name: metrics-logger-and-collector resources: limits: cpu: 500m ephemeral-storage: 5Gi memory: 100Mi requests: cpu: 50m ephemeral-storage: 500Mi memory: 10Mi terminationMessagePath: /dev/termination-log terminationMessagePolicy: File volumeMounts:
- mountPath: /var/log/katib name: metrics-volume
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount name: kube-api-access-dbwd6 readOnly: true dnsPolicy: ClusterFirst enableServiceLinks: true nodeName: lxkssdmlpdevs10 preemptionPolicy: PreemptLowerPriority priority: 0 restartPolicy: Never schedulerName: default-scheduler securityContext: {} serviceAccount: default serviceAccountName: default shareProcessNamespace: true terminationGracePeriodSeconds: 30 tolerations:
effect: NoExecute key: node.kubernetes.io/not-ready operator: Exists tolerationSeconds: 300
effect: NoExecute key: node.kubernetes.io/unreachable operator: Exists tolerationSeconds: 300 volumes:
name: kube-api-access-dbwd6 projected: defaultMode: 420 sources:
- serviceAccountToken: expirationSeconds: 3607 path: token
- configMap: items:
  - key: ca.crt path: ca.crt name: kube-root-ca.crt
- downwardAPI: items:
  - fieldRef: apiVersion: v1 fieldPath: metadata.namespace path: namespace
emptyDir: {} name: metrics-volume status: conditions:
lastProbeTime: null lastTransitionTime: "2023-07-20T11:20:31Z" reason: PodCompleted status: "True" type: Initialized
lastProbeTime: null lastTransitionTime: "2023-07-20T11:20:34Z" reason: PodCompleted status: "False" type: Ready
lastProbeTime: null lastTransitionTime: "2023-07-20T11:20:34Z" reason: PodCompleted status: "False" type: ContainersReady
lastProbeTime: null lastTransitionTime: "2023-07-20T11:20:31Z" status: "True" type: PodScheduled containerStatuses:
containerID: docker://8eecc05be56a0f4fff19eadef3e951b356b1317bd58f5bdaf6274e6a34432f8a image: e-ppsc-docker-virtual.docker.lowes.com/kubeflowkatib/file-metrics-collector:v0.12.0 imageID: docker-pullable://e-ppsc-docker-virtual.docker.lowes.com/kubeflowkatib/file-metrics-collector@sha256:167d14775f3818d6e0698feae1381b048501ac59b7db5ea3958d501013572207 lastState: {} name: metrics-logger-and-collector ready: false restartCount: 0 started: false state: terminated: containerID: docker://8eecc05be56a0f4fff19eadef3e951b356b1317bd58f5bdaf6274e6a34432f8a exitCode: 0 finishedAt: "2023-07-20T11:20:36Z" reason: Completed startedAt: "2023-07-20T11:20:33Z"
containerID: docker://9d39a5624514b87812a5c48d001d66dfe1b3c25fd4795ec9e7b794bdc3d34627 image: e-dpiac-docker-local.docker.lowes.com/katib-sklearn:v4 imageID: docker-pullable://e-dpiac-docker-local.docker.lowes.com/katib-sklearn@sha256:6be7e71447d6f17a89ac6f8ae5a370d12a86beec423af05a6b7af45224feae91 lastState: {} name: training-container ready: false restartCount: 0 started: false state: terminated: containerID: docker://9d39a5624514b87812a5c48d001d66dfe1b3c25fd4795ec9e7b794bdc3d34627 exitCode: 0 finishedAt: "2023-07-20T11:20:34Z" reason: Completed startedAt: "2023-07-20T11:20:32Z" hostIP: 172.29.53.223 phase: Succeeded podIP: 10.42.12.228 podIPs:
ip: 10.42.12.228 qosClass: Burstable startTime: "2023-07-20T11:20:31Z"

andreyvelich commented 1 year ago

I think, in your updated code you forgot to setup logging config: https://github.com/kubeflow/katib/issues/2175#issuecomment-1643374502 That is why you don't see logs inside your K8s container (Locally that is not required). Try to add the following in your Training Script:

logging.basicConfig(
    format="%(asctime)s %(levelname)-8s %(message)s",
    datefmt="%Y-%m-%dT%H:%M:%SZ",
    level=logging.INFO,
)

If it still doesn't work, try to add python3 -u /app/iris_log.py in your start command.

mChowdhury-91 commented 1 year ago

@andreyvelich @tenzen-y Thanks.. Now it works.. All the trial runs were successfully completed and the status of the Experiment on the Katib UI shows success. But in the cluster, the main pod(suggestion container) remains in the Running status and the pod is never killed. I have seen behaviour in the past as well for tf-mnist and pytorch-mnist examples.. Any comment on that?

These are the logs from the suggestion container: INFO:pkg.suggestion.v1beta1.hyperopt.base_service:GetSuggestions returns 1 new Trial

INFO:pkg.suggestion.v1beta1.hyperopt.base_service:GetSuggestions returns 1 new Trial

andreyvelich commented 1 year ago

That is correct behaviour since you use Katib 0.12 version. In that version, the default ResumePolicy=LongRunning. Which allows you to restart your Experiment at any time by changing the maxTrialCount parameter. In that case, Suggestion pod is always running.

In the recent release, we use ResumePolicy=Never as a default resume policy, which won't allow you to restart an Experiment and cleanup the Suggestion Pod.

You can learn more about it in this doc: https://www.kubeflow.org/docs/components/katib/resume-experiment/#resume-succeeded-experiment

mChowdhury-91 commented 1 year ago

@andreyvelich So we need to update only the katib-controller right? Which is the latest version? Also in Katib 0.12 version, can we explicitly add ResumePolicy=Never so that we don't need an update.

andreyvelich commented 1 year ago

Also in Katib 0.12 version, can we explicitly add ResumePolicy=Never so that we don't need an update.

Yeah, that also should work.

So we need to update only the katib-controller right? Which is the latest version?

The latest version is Katib 0.15: https://github.com/kubeflow/katib/releases/tag/v0.15.0. I would suggest update all components all together since it might be more changes in other components too.

If you use Katib without Kubeflow installation, you can update it as follows:

kubectl apply -k "github.com/kubeflow/katib.git/manifests/v1beta1/installs/katib-standalone?ref=v0.15.0"

andreyvelich commented 1 year ago

@mChowdhury-91 Feel free to re-open this issue if you have any followup questions.

kubeflow / katib

Error "Objective metric accuracy is not found in training logs, unavailable value is reported. metric:<name:"accuracy" value:"unavailable" #2175

YOUR IMPORTS HERE

Below is my yaml file: