Closed mChowdhury-91 closed 1 year ago
Hi @mChowdhury-91,
Since you use the default metrics collector you should print your metrics in the following format:
logging.info(f"accuracy={accuracy}")
You can find the default format regex for Metrics Collector here: https://www.kubeflow.org/docs/components/katib/experiment/#metrics-collector
If you want to use Hypertune logging feature that reports logs as JSON, you need to use the following Metrics Collector Spec.. cc @tenzen-y
@andreyvelich Thanks for sending me ping. @mChowdhury-91 Also, we can see a sample using the hypertune here: https://github.com/kubeflow/katib/blob/89bd21f710fb4cd153a33c94e8892ec079cf63c8/examples/v1beta1/trial-images/pytorch-mnist/mnist.py#L85-L93
@andreyvelich Thanks for your reply. I tried the same, still no luck. Here's my updated code:
def main(): parser = argparse.ArgumentParser() parser.add_argument('--neighbors', type=int, default=3, help='value of k') parser.add_argument("--log-path", type=str, default="", help="Path to save logs. Print to StdOut if log-path is not set") args = parser.parse_args()
# LOAD DATA HERE
iris_data = load_iris()
iris_df = pd.DataFrame(data=iris_data['data'], columns=iris_data['feature_names'])
iris_df['Iris type'] = iris_data['target']
iris_df['Iris name'] = iris_df['Iris type'].apply(
lambda x: 'sentosa' if x == 0 else ('versicolor' if x == 1 else 'virginica'))
def f(x):
if x == 0:
val = 'setosa'
elif x == 1:
val = 'versicolor'
else:
val = 'virginica'
return val
iris_df['test'] = iris_df['Iris type'].apply(f)
iris_df.drop(['test'], axis=1, inplace=True)
X = iris_df[['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']]
y = iris_df['Iris name']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
knn = KNeighborsClassifier(n_neighbors=args.neighbors)
knn.fit(X_train, y_train)
accuracy = knn.score(X_test, y_test)
logging.info(f"accuracy={accuracy}")
if name == 'main': main()
Error is still the same: metric:<name:"accuracy" value:"unavailable" > >
@mChowdhury-91 Can you share the fixed Experiment manifest?
@tenzen-y This is my yaml file:
apiVersion: kubeflow.org/v1beta1 kind: Experiment metadata: namespace: kubeflow name: iris-log spec: parallelTrialCount: 3 maxTrialCount: 12 maxFailedTrialCount: 3 objective: type: maximize goal: 0.99 objectiveMetricName: accuracy metricsCollectorSpec: collector: kind: StdOut algorithm: algorithmName: random parameters:
@mChowdhury-91 Thanks. You need to modify the metricsCollectorSpec
like this:
@tenzen-y I'm trying to log to default StdOut default metrics collector using logging.info(f"accuracy={accuracy}") as suggested by @andreyvelich and not the hypertune(log to JSON)
@tenzen-y I'm trying to log to default StdOut default metrics collector using logging.info(f"accuracy={accuracy}") as suggested by @andreyvelich and not the hypertune(log to JSON)
I see. If you use the StdOut collector, you need to create a regexp to express the logs. https://regex101.com/ might be helpful.
@tenzen-y Do you have any example ? What should I write in the regexp ?
@tenzen-y Do you have any example ? What should I write in the regexp ?
Katib default filter is here: https://github.com/kubeflow/katib/blob/89bd21f710fb4cd153a33c94e8892ec079cf63c8/pkg/metricscollector/v1beta1/common/const.go#L39-L47
I think, your Experiment should work with StdOut Metrics Collector @mChowdhury-91 . Please can you share logs from one of your Trials ?
@andreyvelich This is the log from the Trial Pod with : logging.info(f"accuracy={accuracy}")
I0720 11:20:33.136977 29 main.go:342] Trial Name: iris-log-8xnppxc5 I0720 11:20:36.148952 29 file-metricscollector.go:118] Objective metric accuracy is not found in training logs, unavailable value is reported I0720 11:20:36.158568 29 main.go:399] Metrics reported. : metric_logs:<time_stamp:"0001-01-01T00:00:00Z" metric:<name:"accuracy" value:"unavailable" > >
@andreyvelich This is the log from the Trial Pod with : logging.info(f"accuracy={accuracy}")
I0720 11:20:33.136977 29 main.go:342] Trial Name: iris-log-8xnppxc5 I0720 11:20:36.148952 29 file-metricscollector.go:118] Objective metric accuracy is not found in training logs, unavailable value is reported I0720 11:20:36.158568 29 main.go:399] Metrics reported. : metric_logs:<time_stamp:"0001-01-01T00:00:00Z" metric:<name:"accuracy" value:"unavailable" > >
Can you get logs from pods with --all-containers
options?
Maybe your logs are written to the log_path ? Can you just try to run your code locally with some test parameters to check if logs are printed to the stdout ?
Maybe your logs are written to the log_path ? Can you just try to run your code locally with some test parameters to check if logs are printed to the stdout ?
Output when I run this: python3 iris.py 2023-07-20T17:13:38Z INFO accuracy=0.9736842105263158
@mChowdhury-91 Please can you describe your Trial pod for me ?
kubectl get pod <trial-pod-name> -n kubeflow -o yaml
kubectl get pod
-n kubeflow -o yaml
apiVersion: v1 kind: Pod metadata: annotations: cni.projectcalico.org/containerID: 11204fb3ae5fd95859bbd9e159480fd52d5da787f6ad4088b994cdd0ea5a8ce7 cni.projectcalico.org/podIP: "" cni.projectcalico.org/podIPs: "" sidecar.istio.io/inject: "false" creationTimestamp: "2023-07-20T11:20:31Z" generateName: iris-log-8xnppxc5- labels: controller-uid: a7f8ff3e-1013-485c-bf23-ef0caa0cb765 job-name: iris-log-8xnppxc5 name: iris-log-8xnppxc5-ddbkr namespace: mlp-profile ownerReferences:
I think, in your updated code you forgot to setup logging config: https://github.com/kubeflow/katib/issues/2175#issuecomment-1643374502 That is why you don't see logs inside your K8s container (Locally that is not required). Try to add the following in your Training Script:
logging.basicConfig(
format="%(asctime)s %(levelname)-8s %(message)s",
datefmt="%Y-%m-%dT%H:%M:%SZ",
level=logging.INFO,
)
If it still doesn't work, try to add python3 -u /app/iris_log.py
in your start command.
@andreyvelich @tenzen-y Thanks.. Now it works.. All the trial runs were successfully completed and the status of the Experiment on the Katib UI shows success. But in the cluster, the main pod(suggestion container) remains in the Running status and the pod is never killed. I have seen behaviour in the past as well for tf-mnist and pytorch-mnist examples.. Any comment on that?
These are the logs from the suggestion container: INFO:pkg.suggestion.v1beta1.hyperopt.base_service:GetSuggestions returns 1 new Trial
INFO:pkg.suggestion.v1beta1.hyperopt.base_service:GetSuggestions returns 1 new Trial
That is correct behaviour since you use Katib 0.12 version. In that version, the default ResumePolicy=LongRunning
. Which allows you to restart your Experiment at any time by changing the maxTrialCount
parameter. In that case, Suggestion pod is always running.
In the recent release, we use ResumePolicy=Never
as a default resume policy, which won't allow you to restart an Experiment and cleanup the Suggestion Pod.
You can learn more about it in this doc: https://www.kubeflow.org/docs/components/katib/resume-experiment/#resume-succeeded-experiment
@andreyvelich So we need to update only the katib-controller right? Which is the latest version? Also in Katib 0.12 version, can we explicitly add ResumePolicy=Never so that we don't need an update.
Also in Katib 0.12 version, can we explicitly add ResumePolicy=Never so that we don't need an update.
Yeah, that also should work.
So we need to update only the katib-controller right? Which is the latest version?
The latest version is Katib 0.15: https://github.com/kubeflow/katib/releases/tag/v0.15.0. I would suggest update all components all together since it might be more changes in other components too.
If you use Katib without Kubeflow installation, you can update it as follows:
kubectl apply -k "github.com/kubeflow/katib.git/manifests/v1beta1/installs/katib-standalone?ref=v0.15.0"
@mChowdhury-91 Feel free to re-open this issue if you have any followup questions.
/kind bug
What steps did you take and what happened: I have been trying to create a simple Katib experiment with sklearn iris dataset but am facing an error "Objective metric accuracy is not found in training logs, unavailable value is reported. metric:<name:"accuracy" value:"unavailable"
Below is my code: import argparse import os import hypertune import logging import pandas as pd
YOUR IMPORTS HERE
from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.neighbors import KNeighborsClassifier
def main(): parser = argparse.ArgumentParser() parser.add_argument('--neighbors', type=int, default=3, help='value of k') parser.add_argument("--log-path", type=str, default="", help="Path to save logs. Print to StdOut if log-path is not set") parser.add_argument("--logger", type=str, choices=["standard", "hypertune"], help="Logger", default="standard") args = parser.parse_args()
if name == 'main': main()
Below is my yaml file:
apiVersion: kubeflow.org/v1beta1 kind: Experiment metadata: namespace: kubeflow name: iris-1 spec: parallelTrialCount: 1 maxTrialCount: 2 maxFailedTrialCount: 3 objective: type: maximize goal: 0.99 objectiveMetricName: accuracy metricsCollectorSpec: collector: kind: StdOut algorithm: algorithmName: random parameters:
What did you expect to happen: The metrics should have been collected and the trials should have succeeded..
Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.]
Environment:
kubectl version
):uname -a
):Impacted by this bug? Give it a 👍 We prioritize the issues with the most 👍