kubeflow / katib

Automated Machine Learning on Kubernetes
https://www.kubeflow.org/docs/components/katib
Apache License 2.0
1.49k stars 441 forks source link

Error "Objective metric accuracy is not found in training logs, unavailable value is reported. metric:<name:"accuracy" value:"unavailable" #2175

Closed mChowdhury-91 closed 1 year ago

mChowdhury-91 commented 1 year ago

/kind bug

What steps did you take and what happened: I have been trying to create a simple Katib experiment with sklearn iris dataset but am facing an error "Objective metric accuracy is not found in training logs, unavailable value is reported. metric:<name:"accuracy" value:"unavailable"

Below is my code: import argparse import os import hypertune import logging import pandas as pd

YOUR IMPORTS HERE

from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.neighbors import KNeighborsClassifier

def main(): parser = argparse.ArgumentParser() parser.add_argument('--neighbors', type=int, default=3, help='value of k') parser.add_argument("--log-path", type=str, default="", help="Path to save logs. Print to StdOut if log-path is not set") parser.add_argument("--logger", type=str, choices=["standard", "hypertune"], help="Logger", default="standard") args = parser.parse_args()

if args.log_path == "" or args.logger == "hypertune":
    logging.basicConfig(
        format="%(asctime)s %(levelname)-8s %(message)s",
        datefmt="%Y-%m-%dT%H:%M:%SZ",
        level=logging.DEBUG)
else:
    logging.basicConfig(
        format="%(asctime)s %(levelname)-8s %(message)s",
        datefmt="%Y-%m-%dT%H:%M:%SZ",
        level=logging.DEBUG,
        filename=args.log_path)

if args.logger == "hypertune" and args.log_path != "":
    os.environ['CLOUD_ML_HP_METRIC_FILE'] = args.log_path

# For JSON logging
hpt = hypertune.HyperTune()

# LOAD DATA HERE
iris_data = load_iris()
iris_df = pd.DataFrame(data=iris_data['data'], columns=iris_data['feature_names'])
iris_df['Iris type'] = iris_data['target']
iris_df['Iris name'] = iris_df['Iris type'].apply(
    lambda x: 'sentosa' if x == 0 else ('versicolor' if x == 1 else 'virginica'))

def f(x):
    if x == 0:
        val = 'setosa'
    elif x == 1:
        val = 'versicolor'
    else:
        val = 'virginica'
    return val

iris_df['test'] = iris_df['Iris type'].apply(f)
iris_df.drop(['test'], axis=1, inplace=True)

X = iris_df[['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']]
y = iris_df['Iris name']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
knn = KNeighborsClassifier(n_neighbors=args.neighbors)
knn.fit(X_train, y_train)
accuracy = knn.score(X_test, y_test)

logging.info("{{metricName: accuracy, metricValue: {:.4f}}}\n".format(accuracy))

if args.logger == "hypertune":
    hpt.report_hyperparameter_tuning_metric(
        hyperparameter_metric_tag='accuracy',
        metric_value=accuracy)

if name == 'main': main()

Below is my yaml file:

apiVersion: kubeflow.org/v1beta1 kind: Experiment metadata: namespace: kubeflow name: iris-1 spec: parallelTrialCount: 1 maxTrialCount: 2 maxFailedTrialCount: 3 objective: type: maximize goal: 0.99 objectiveMetricName: accuracy metricsCollectorSpec: collector: kind: StdOut algorithm: algorithmName: random parameters:

What did you expect to happen: The metrics should have been collected and the trials should have succeeded..

Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.]

Environment:


Impacted by this bug? Give it a 👍 We prioritize the issues with the most 👍

andreyvelich commented 1 year ago

Hi @mChowdhury-91,

Since you use the default metrics collector you should print your metrics in the following format:

logging.info(f"accuracy={accuracy}")

You can find the default format regex for Metrics Collector here: https://www.kubeflow.org/docs/components/katib/experiment/#metrics-collector

If you want to use Hypertune logging feature that reports logs as JSON, you need to use the following Metrics Collector Spec.. cc @tenzen-y

tenzen-y commented 1 year ago

@andreyvelich Thanks for sending me ping. @mChowdhury-91 Also, we can see a sample using the hypertune here: https://github.com/kubeflow/katib/blob/89bd21f710fb4cd153a33c94e8892ec079cf63c8/examples/v1beta1/trial-images/pytorch-mnist/mnist.py#L85-L93

mChowdhury-91 commented 1 year ago

@andreyvelich Thanks for your reply. I tried the same, still no luck. Here's my updated code:

def main(): parser = argparse.ArgumentParser() parser.add_argument('--neighbors', type=int, default=3, help='value of k') parser.add_argument("--log-path", type=str, default="", help="Path to save logs. Print to StdOut if log-path is not set") args = parser.parse_args()

# LOAD DATA HERE
iris_data = load_iris()
iris_df = pd.DataFrame(data=iris_data['data'], columns=iris_data['feature_names'])
iris_df['Iris type'] = iris_data['target']
iris_df['Iris name'] = iris_df['Iris type'].apply(
    lambda x: 'sentosa' if x == 0 else ('versicolor' if x == 1 else 'virginica'))

def f(x):
    if x == 0:
        val = 'setosa'
    elif x == 1:
        val = 'versicolor'
    else:
        val = 'virginica'
    return val

iris_df['test'] = iris_df['Iris type'].apply(f)
iris_df.drop(['test'], axis=1, inplace=True)

X = iris_df[['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']]
y = iris_df['Iris name']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
knn = KNeighborsClassifier(n_neighbors=args.neighbors)
knn.fit(X_train, y_train)
accuracy = knn.score(X_test, y_test)

logging.info(f"accuracy={accuracy}")

if name == 'main': main()

Error is still the same: metric:<name:"accuracy" value:"unavailable" > >

tenzen-y commented 1 year ago

@mChowdhury-91 Can you share the fixed Experiment manifest?

mChowdhury-91 commented 1 year ago

@tenzen-y This is my yaml file:


apiVersion: kubeflow.org/v1beta1 kind: Experiment metadata: namespace: kubeflow name: iris-log spec: parallelTrialCount: 3 maxTrialCount: 12 maxFailedTrialCount: 3 objective: type: maximize goal: 0.99 objectiveMetricName: accuracy metricsCollectorSpec: collector: kind: StdOut algorithm: algorithmName: random parameters:

tenzen-y commented 1 year ago

@mChowdhury-91 Thanks. You need to modify the metricsCollectorSpec like this:

https://github.com/kubeflow/katib/blob/89bd21f710fb4cd153a33c94e8892ec079cf63c8/examples/v1beta1/metrics-collector/file-metrics-collector-with-json-format.yaml#L14-L21

mChowdhury-91 commented 1 year ago

@tenzen-y I'm trying to log to default StdOut default metrics collector using logging.info(f"accuracy={accuracy}") as suggested by @andreyvelich and not the hypertune(log to JSON)

tenzen-y commented 1 year ago

@tenzen-y I'm trying to log to default StdOut default metrics collector using logging.info(f"accuracy={accuracy}") as suggested by @andreyvelich and not the hypertune(log to JSON)

I see. If you use the StdOut collector, you need to create a regexp to express the logs. https://regex101.com/ might be helpful.

mChowdhury-91 commented 1 year ago

@tenzen-y Do you have any example ? What should I write in the regexp ?

tenzen-y commented 1 year ago

@tenzen-y Do you have any example ? What should I write in the regexp ?

Katib default filter is here: https://github.com/kubeflow/katib/blob/89bd21f710fb4cd153a33c94e8892ec079cf63c8/pkg/metricscollector/v1beta1/common/const.go#L39-L47

andreyvelich commented 1 year ago

I think, your Experiment should work with StdOut Metrics Collector @mChowdhury-91 . Please can you share logs from one of your Trials ?

mChowdhury-91 commented 1 year ago

@andreyvelich This is the log from the Trial Pod with : logging.info(f"accuracy={accuracy}")

I0720 11:20:33.136977 29 main.go:342] Trial Name: iris-log-8xnppxc5 I0720 11:20:36.148952 29 file-metricscollector.go:118] Objective metric accuracy is not found in training logs, unavailable value is reported I0720 11:20:36.158568 29 main.go:399] Metrics reported. : metric_logs:<time_stamp:"0001-01-01T00:00:00Z" metric:<name:"accuracy" value:"unavailable" > >

tenzen-y commented 1 year ago

@andreyvelich This is the log from the Trial Pod with : logging.info(f"accuracy={accuracy}")

I0720 11:20:33.136977 29 main.go:342] Trial Name: iris-log-8xnppxc5 I0720 11:20:36.148952 29 file-metricscollector.go:118] Objective metric accuracy is not found in training logs, unavailable value is reported I0720 11:20:36.158568 29 main.go:399] Metrics reported. : metric_logs:<time_stamp:"0001-01-01T00:00:00Z" metric:<name:"accuracy" value:"unavailable" > >

Can you get logs from pods with --all-containers options?

andreyvelich commented 1 year ago

Maybe your logs are written to the log_path ? Can you just try to run your code locally with some test parameters to check if logs are printed to the stdout ?

mChowdhury-91 commented 1 year ago

Maybe your logs are written to the log_path ? Can you just try to run your code locally with some test parameters to check if logs are printed to the stdout ?

Output when I run this: python3 iris.py 2023-07-20T17:13:38Z INFO accuracy=0.9736842105263158

andreyvelich commented 1 year ago

@mChowdhury-91 Please can you describe your Trial pod for me ?

kubectl get pod <trial-pod-name> -n kubeflow -o yaml
mChowdhury-91 commented 1 year ago

kubectl get pod -n kubeflow -o yaml

apiVersion: v1 kind: Pod metadata: annotations: cni.projectcalico.org/containerID: 11204fb3ae5fd95859bbd9e159480fd52d5da787f6ad4088b994cdd0ea5a8ce7 cni.projectcalico.org/podIP: "" cni.projectcalico.org/podIPs: "" sidecar.istio.io/inject: "false" creationTimestamp: "2023-07-20T11:20:31Z" generateName: iris-log-8xnppxc5- labels: controller-uid: a7f8ff3e-1013-485c-bf23-ef0caa0cb765 job-name: iris-log-8xnppxc5 name: iris-log-8xnppxc5-ddbkr namespace: mlp-profile ownerReferences:

andreyvelich commented 1 year ago

I think, in your updated code you forgot to setup logging config: https://github.com/kubeflow/katib/issues/2175#issuecomment-1643374502 That is why you don't see logs inside your K8s container (Locally that is not required). Try to add the following in your Training Script:

logging.basicConfig(
    format="%(asctime)s %(levelname)-8s %(message)s",
    datefmt="%Y-%m-%dT%H:%M:%SZ",
    level=logging.INFO,
)

If it still doesn't work, try to add python3 -u /app/iris_log.py in your start command.

mChowdhury-91 commented 1 year ago

@andreyvelich @tenzen-y Thanks.. Now it works.. All the trial runs were successfully completed and the status of the Experiment on the Katib UI shows success. But in the cluster, the main pod(suggestion container) remains in the Running status and the pod is never killed. I have seen behaviour in the past as well for tf-mnist and pytorch-mnist examples.. Any comment on that?

These are the logs from the suggestion container: INFO:pkg.suggestion.v1beta1.hyperopt.base_service:GetSuggestions returns 1 new Trial

INFO:pkg.suggestion.v1beta1.hyperopt.base_service:GetSuggestions returns 1 new Trial

andreyvelich commented 1 year ago

That is correct behaviour since you use Katib 0.12 version. In that version, the default ResumePolicy=LongRunning. Which allows you to restart your Experiment at any time by changing the maxTrialCount parameter. In that case, Suggestion pod is always running.

In the recent release, we use ResumePolicy=Never as a default resume policy, which won't allow you to restart an Experiment and cleanup the Suggestion Pod.

You can learn more about it in this doc: https://www.kubeflow.org/docs/components/katib/resume-experiment/#resume-succeeded-experiment

mChowdhury-91 commented 1 year ago

@andreyvelich So we need to update only the katib-controller right? Which is the latest version? Also in Katib 0.12 version, can we explicitly add ResumePolicy=Never so that we don't need an update.

andreyvelich commented 1 year ago

Also in Katib 0.12 version, can we explicitly add ResumePolicy=Never so that we don't need an update.

Yeah, that also should work.

So we need to update only the katib-controller right? Which is the latest version?

The latest version is Katib 0.15: https://github.com/kubeflow/katib/releases/tag/v0.15.0. I would suggest update all components all together since it might be more changes in other components too.

If you use Katib without Kubeflow installation, you can update it as follows:

kubectl apply -k "github.com/kubeflow/katib.git/manifests/v1beta1/installs/katib-standalone?ref=v0.15.0"
andreyvelich commented 1 year ago

@mChowdhury-91 Feel free to re-open this issue if you have any followup questions.