Experiment stuck due to hitting `Suggestion` custom resource size limits

nielsmeima commented 2 years ago

/kind bug

What steps did you take and what happened: Submitting a large (i.e. resulting in a large number of trials, in this case ~14500 with 4 hyperparameters with 10/11 values per hyperparameter) experiment results in the Suggestion custom resource reaching the size limits of custom resources dictated by Kubernetes due to all suggestions being stored in this resource. This results in the following error being output by the Katib controller when trying to update the Suggestion custom resource: Request entity too large and the experiment not being able to progress. This issue seems to describe the exact problem.

Argo Workflows seems to have encountered the same problem, described here and solved it by allowing for 1) compression of the data stored in the status field of the custom resource and 2) storage of information under the status field in a relational database as described here.

What did you expect to happen: I expected Katib to be able to handle search spaces or arbitrary size.

Anything else you would like to add: A workaround would be to manually split the experiment into smaller subexperiments to circumvent the size limits of custom resources. Ideally, this is solved by following a similar approach as Argo does for their Workflow custom resources.

Impacted by this bug? Give it a 👍 We prioritize the issues with the most 👍

johnugeorge commented 2 years ago

Thanks for creating this issue. Can you provide more info about the experiment yaml and other relevant information for reproducibility?

MaxTrials - 14500 Parameter1, Parameter2, Parameter3, Parameter4 - 11 values each ParallelTrials - ?

nielsmeima commented 2 years ago

Yes, please see below for the experiment yaml as well as the other dependencies to be able to reproduce the experiment. I create a configmap k create configmap script --from-file=run.py=run.py based on a run.py file to mock my actual implementation (but this also results in the same issues). The configmap then gets mounted in the main container of the experiment taking in the parameters and producing a value for the objective of interest. I have also attached the run.py file below.

Experiment

apiVersion: kubeflow.org/v1beta1
kind: Experiment
metadata:
  name: debug
spec:
  objective:
    type: maximize
    goal: 500
    objectiveMetricName: cost
  algorithm:
    algorithmName: grid
  parallelTrialCount: 20
  maxTrialCount: 14641
  maxFailedTrialCount: 2000
  parameters:
    - name: a
      parameterType: double
      feasibleSpace:
        min: "1.30"
        max: "1.41"
        step: "0.01"
    - name: b
      parameterType: categorical
      feasibleSpace:
        list: ["0.0010", "0.0025", "0.0063", "0.0158", "0.0398", "0.1000", "0.2512", "0.6310", "1.5849", "3.9811", "10.0000"]
    - name: c
      parameterType: double
      feasibleSpace:
        min: "1.30"
        max: "1.41"
        step: "0.01"
    - name: d
      parameterType: categorical
      feasibleSpace:
        list: ["0.0010", "0.0025", "0.0063", "0.0158", "0.0398", "0.1000", "0.2512", "0.6310", "1.5849", "3.9811", "10.0000"]
  trialTemplate:
    retain: false
    primaryContainerName: training-container
    trialParameters:
      - name: a
        reference: a
        description: ""
      - name: b
        reference: b
        description: ""
      - name: c 
        reference: c
        description: ""
      - name: d 
        reference: d
        description: ""
    trialSpec:
      apiVersion: batch/v1
      kind: Job
      spec:
        template:
          spec:
            containers:
              - name: training-container
                image: docker.io/python:alpine3.15
                volumeMounts:
                - name: script
                  mountPath: /app/run.py
                  subPath: run.py
                command:
                  - "python3"
                  - "/app/run.py"
                  - "${trialParameters.a}"
                  - "${trialParameters.b}"
                  - "${trialParameters.c}"
                  - "${trialParameters.d}"
            restartPolicy: Never
            volumes:
              - name: script
                configMap:
                  name: script

The mock implementation run.py

import sys
import time
time.sleep(4)
cost = sum([float(x) for x in sys.argv[1:]])
print(f"cost={cost}")

EDIT: and to provide you an idea of the error messages arising from the katib-controller (this is a slightly different experiment, but the errors are identical)

{"level":"info","ts":1649925772.8911624,"logger":"suggestion-controller","msg":"Sync assignments","Suggestion":"simulation/simulation-nr-fb","Suggestion Requests":8165,"Suggestion Count":8143}
{"level":"info","ts":1649925774.5568578,"logger":"suggestion-client","msg":"Getting suggestions","Suggestion":"simulation/simulation-nr-fb","endpoint":"simulation-nr-fb-grid.simulation:6789","Number of current request parameters":22,"Number of response parameters":22}
{"level":"info","ts":1649925775.6414711,"logger":"suggestion-controller","msg":"Update suggestion instance status failed, reconciler requeued","Suggestion":"simulation/simulation-nr-fb","err":"rpc error: code = ResourceExhausted desc = trying to send message larger than max (2100613 vs. 2097152)"}

robertzsun-dev commented 1 year ago

I have the same problem running Sobol suggestions. You basically just need enough trials and it will collapse on you. First time its with the etcd request size being too large, which you can "fix" by increasing the etcd max request size, but then you run into the issue in this thread, which is that we are hitting CRD limits. I don't think you can try to work around this.

{"level":"info","ts":1684216616.817069,"logger":"suggestion-controller","msg":"Update suggestion instance status failed, reconciler requeued","Suggestion":"argo/katib-hyperparam-opt-feature-update","err":"rpc error: code = ResourceExhausted desc = trying to send message larger than max (2103483 vs. 2097152)"}

andreyvelich commented 1 year ago

Hi @robertzsun-dev, can it be limitation of Goptuna algorithm that we use for Sobol ? @c-bata Do we have any limitation of maximum number of Trials in Goptuna ?

Also, related issue: https://github.com/kubeflow/katib/issues/1058.

robertzsun-dev commented 1 year ago

@andreyvelich No, I don't think it's a limitation of Goptuna. I am seeing the suggestion controller offer valid new suggestions in the suggestion loop.

I think issue is due to Katib architecture : https://github.com/kubeflow/katib/blob/b9dc63efb55bec19bdd80ca6f2930a65795f7c41/pkg/controller.v1beta1/suggestion/suggestionclient/suggestionclient.go#L191

We can see that new suggestions are simply appended to the suggestionsv1beta1.Suggestion resource's Status field (under Status.Suggestions).

I don't think its K8S-kosher to just add data to this Status field in a K8S resource. This YAML list gets extremely long when you have 1000s of Suggestions, and especially when you have lots of Parameter Assignments, this list actually gets to megabytes of size, eventually exceeding the limit that K8S has on resource size in Etcd. And that is error I posted above.

robertzsun-dev commented 1 year ago

With the related issue of https://github.com/kubeflow/katib/issues/1058.

I don't think its directly related, but indirectly related in the fact that Katib is not really built for running mass-scale experiments. I do think for ML, the number of hyperparameters may not be high, and you will not need to run 1000's to 10000's of experiments. But for other things (like what I am doing), we do want to run tons of experiments.

Both the fact that Suggestions/Trials are stored as an Etcd resource, as well as the other issue linked, which prevents long-running suggestions (necessary because 1000s of trials takes a while to compute the next suggestion) from working, prevent users from achieving scale with Katib.

nielsmeima commented 1 year ago

I think @robertzsun-dev is correct in his assessment. At work we moved to a custom system (non-Katib) for performing large scale (> 20k) experiments which does not rely on the Kubernetes custom resource model.

andreyvelich commented 1 year ago

Thanks for the information @robertzsun-dev. Yes, you are right the etcd default size is 1.5 MiB, which makes it impossible to store large chunk of data in the Custom Resource.

I understand that ordinary HP Tuning Experiments might not require 10000 Trials, but for some cases it might be useful. Since, Katib allows you to use optimisation algorithms for any type of the task (as long as Trial is set), we can find work-around for it.

@robertzsun-dev @nielsmeima Please can you describe your use-case, when you need to run Experiments for more than 10000 Trials.

As a solution, we can store such information in Katib DB instead of Suggestion CR or Experiment CR.

cc @johnugeorge @tenzen-y @gaocegege

tenzen-y commented 1 year ago

I would propose we set the limit for the number of AlgorithSettings and Suggestions. And then, if exceeded the limitation, the katib-controller creates configMaps to write the results of experiments and clean up AlgorithSettings and Suggestions.

https://github.com/kubeflow/katib/blob/b9dc63efb55bec19bdd80ca6f2930a65795f7c41/pkg/apis/controller/suggestions/v1beta1/suggestion_types.go#L46-L49

https://github.com/kubeflow/katib/blob/b9dc63efb55bec19bdd80ca6f2930a65795f7c41/pkg/apis/controller/suggestions/v1beta1/suggestion_types.go#L54-L55

wdyt?

andreyvelich commented 1 year ago

And then, if exceeded the limitation, the katib-controller creates configMaps to write the results of experiments and clean up AlgorithSettings and Suggestions.

I think, storing such information in the ConfigMap might also be a problem, since the limit is 1 Mb: https://kubernetes.io/docs/concepts/configuration/configmap/#motivation

That is why, I suggested to store that info in the Katib DB if that is possible.

tenzen-y commented 1 year ago

I think, storing such information in the ConfigMap might also be a problem, since the limit is 1 Mb: https://kubernetes.io/docs/concepts/configuration/configmap/#motivation

Yes, that's right. I was thinking of creating multiple configMaps.

andreyvelich commented 1 year ago

I see. @tenzen-y any objections you see to store that info in the MySQL/Postgres DB ?

tenzen-y commented 1 year ago

I see. @tenzen-y any objections you see to store that info in the MySQL/Postgres DB ?

As per the opinions of the on-prem cluster administrator side.

Currently, even if the katib-db crashes, it is easy to check the result of experiments since CRs have the results for the experiments in etcd.

But, Storing the result of experiments to the katib-db increases the importance of katib-db. I wouldn't increase the number of storage with high importance.

robertzsun-dev commented 1 year ago

Thanks for the information @robertzsun-dev. Yes, you are right the etcd default size is 1.5 MiB, which makes it impossible to store large chunk of data in the Custom Resource.

I understand that ordinary HP Tuning Experiments might not require 10000 Trials, but for some cases it might be useful. Since, Katib allows you to use optimisation algorithms for any type of the task (as long as Trial is set), we can find work-around for it.

@robertzsun-dev @nielsmeima Please can you describe your use-case, when you need to run Experiments for more than 10000 Trials.

As a solution, we can store such information in Katib DB instead of Suggestion CR or Experiment CR.

cc @johnugeorge @tenzen-y @gaocegege

My use case is for heuristic-driven learning or other types of probabilistic models. Training or learning behaviors from scratch for complex behaviors (for me in robotics) is not really feasible or takes a long time. We introduce a set of heuristics to inform the algorithms. We can do this in multi-step processes, with heuristics only approaches and then training with heuristics, and so on. Generally, evaluating the heuristics takes a very short amount of time, maybe 5 - 20 minutes, and we want to hyperparameter search over this heuristic space.

One such case is say we will do something if the distance to an object is < X meters. What should X be? There might be many of such heuristics, each with their own < Y, > Z, etc... Together, the combination of heuristics and hyper-parameters chosen can greatly affect the quality of the algorithm. We can even search over how to weigh these heuristics against each other.

So we use katib and mass-scale hyperparameter search to just close-in on a good set of hyperparameters - we can even do a sort of distributed coordinate descent by first searching the X, Y, Z's above, then searching the weights, then going back and doing X, Y, Z, etc.... Once we arrive at a good set of heuristics and weights, we can do some more learning on top. The possibilities are endless.

This is just one of many use cases, but highlights the value of mass-scale hyperparameter search.

robertzsun-dev commented 1 year ago

I see. @tenzen-y any objections you see to store that info in the MySQL/Postgres DB ?

As per the opinions of the on-prem cluster administrator side.

Currently, even if the katib-db crashes, it is easy to check the result of experiments since CRs have the results for the experiments in etcd.

But, Storing the result of experiments to the katib-db increases the importance of katib-db. I wouldn't increase the number of storage with high importance.

I tend to think a traditional DB is really the only way to do this properly. What makes etcd more robust or better for this application? The fact that it is HA?

Etcd was chosen by K8S for its great consistency properties and ability to do leader election very well. Is this necessarily a need for Katib?

If we are worried about DB crashes or loss of data with a centralized DB, we could use Redis HA with AOF. But I am honestly fine with single point of failure Postgres. You can leave it to the user how they want to implement the DB backend too, as long as it supports some API. So have a built-in postgres (and only enabled if people want to use it, otherwise fall back to the etcd way) and then offer the user to roll their own db if they want HA or auto-backup or whatever that pleases them. Katib is not so important (as it is not a production system) that I need HA so badly (at least for me).

andreyvelich commented 1 year ago

Thanks for the explanation @robertzsun-dev. I've added this item to discuss it in one of the upcoming Katib Community Meetings: https://docs.google.com/document/d/1MChKfzrKAeFRtYqypFbMXL6ZIc_OgijjkvbqmwRV-64/edit#heading=h.vqsljon7kcug Let's find a solution how to run this massive-scale HP tuning Experiments with Katib.

Also, we should consider adding this problem in one of our ROADMAP items for 2023: https://github.com/kubeflow/katib/pull/2153.

That sounds like a problem to use Katib at scale.

robertzsun-dev commented 1 year ago

@andreyvelich

I just realized the issue I posted here isn't just related to max etcd request size. Once you increase that limit within etcd, you get a grpc message max size error:

https://github.com/kubeflow/katib/blob/b9dc63efb55bec19bdd80ca6f2930a65795f7c41/pkg/client/controller/clientset/versioned/typed/suggestions/v1beta1/suggestion.go#LL62C20-L62C20

The katib controller error is: "rpc error: code = ResourceExhausted desc = trying to send message larger than max (2151219 vs. 2097152)"

So even if etcd size is good, the grpc message size is not big enough either.

robertzsun-dev commented 1 year ago

@andreyvelich

Is there a way to give different configs for the rest client here via the katib configmap?

https://github.com/kubeflow/katib/blob/b9dc63efb55bec19bdd80ca6f2930a65795f7c41/pkg/client/controller/clientset/versioned/typed/suggestions/v1beta1/suggestions_client.go#LL46C12-L46C12

Maybe we can pass in a larger gRPC max message size.

andreyvelich commented 1 year ago

Yes, we also noticed that during our discussion that we have some limitations on the gRPC side.

I think, we can specify the max message size for the gRPC here: https://github.com/kubeflow/katib/blob/b9dc63efb55bec19bdd80ca6f2930a65795f7c41/pkg/controller.v1beta1/suggestion/suggestionclient/suggestionclient.go#L94

Note, that Suggestion servers also should be able to receive such big message (we need to decide how to pass such settings to the Suggestions (e.g. via env vars and Katib Config)): https://github.com/kubeflow/katib/blob/b9dc63efb55bec19bdd80ca6f2930a65795f7c41/cmd/suggestion/hyperopt/v1beta1/main.py#L27

Maybe we can add such feature to the Katib Config, once we redesign the UX for it: https://github.com/kubeflow/katib/issues/2150 cc @tenzen-y

But, we should also decide if that is correct approach to increase etcd and gRPC limitations for the large-scale Experiments.

tenzen-y commented 1 year ago

Uhm... Whichever way we select (DB vs CRs with configMaps), summarizing the assumptions as a document would be better. I'm concerned that we're missing something else.

johnugeorge commented 1 year ago

@tenzen-y CRD resources can freed by moving items to DB(as a backup) but DBs in the active control path is not good idea. It is difficult to track consistency of the data in etcd between DB.

robertzsun-dev commented 1 year ago

@andreyvelich , @tenzen-y

Increasing the etcd and gRPC limitations are definitely not the right solutions here. The limitations are there for a reason: etcd gets slow and loses its performance guarantees with high key value pair sizes gRPC is not really designed to pass such large amounts of data. At what point do you set the max? There will always be another max.

They are short term fixes that I hoped were already implemented to bypass the issues, but if you have to write extra code to get this through, I'd recommend against it.

Do you think I can help in any way? I can start with helping generate this document, or we can meet together to "pair program" this document.

I can also help in the roadmap meetings or architecture meetings to give some advice/opinions/reviews. It might even be pretty good if I can help with development - but I am not really a Go developer. Willing to learn though. I have a bad feeling to do this properly involves a bit of re-architecture around Katib. We have to get rid of instances where the CRD resource is used for "data storage". Suggestion service <-> katib controller probably needs something smarter. I am almost in favor of monolithing them together. If there is not a "real need" for a service, or truly stateless computation, there is no need for an external service.

I can ask a bit from my architects at my company to see what kind of ideas of architecture that may be good for the suggestion service problem.

tenzen-y commented 1 year ago

CRD resources can freed by moving items to DB(as a backup) but DBs in the active control path is not good idea. It is difficult to track consistency of the data in etcd between DB.

@johnugeorge Yes, that's right. IIRC, we considered the issue during the older alpha API days. So, I proposed using CRDs + configMaps in the above comment.

tenzen-y commented 1 year ago

etcd gets slow and loses its performance guarantees with high key value pair sizes

Yes, that's right. However, I think we can calculate the worst case and limit the amount the controller can write to CRs and configMaps.

In k/k, when designing API, we do that.

https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/2214-indexed-job#scalability

robertzsun-dev commented 1 year ago

I asked around and it seems like the best architecture would be backing suggestions with a database instead of in the CRD resource.

Suggestions service still has to be a deployed service since you might have "unlimited" experiments running at the same time, and you don't want the katib controller to use up infinite resources. Still good to separate the two.

However, which suggestions have been used and which suggestions map to which trials should be backed in a database. That way controller and suggestion don't have state still and robust to restart. Can do pagination style messaging if we want the controller and suggestions to iterate in lockstep. Or just point to the idxs or some other column identifier in the request.

tenzen-y commented 1 year ago

I asked around and it seems like the best architecture would be backing suggestions with a database instead of in the CRD resource.

Does "around" mean this issue? Or other places? If that means other places, can you share that? I'm interested in that.

However, which suggestions have been used and which suggestions map to which trials should be backed in a database.

IIUC currently, the katib-controller saves the information to Suggestion resources, and suggestion resources aren't automatically removed even once the experiment is completed.

So, I believe we can write the information to the Suggestion resource, and then if the number of the suggestion map reaches the limit, we back up the suggestion maps to configMap and flush Suggestion status.

Are your concerns the case of when you want a temporary stop Experiment (removes Experiment)? If so, as you say, all information is lost since the controller removes all CRs.

So I think we should introduce cancel semantic (currently not supporting) instead of persistently saving to DB.

ref: #934

tenzen-y commented 1 year ago

This means I proposed 2 features:

When the number of suggestions map in Suggestion status reach the limit, the controller back up the suggestions map to configMap and flushes the suggestions map in Suggestion status.
Support cancel operation. Without deleting Experiments (without losing the suggestions map), users can stop experiments and release computing resources.

robertzsun-dev commented 1 year ago

Haha, I asked my coworkers - who have far more experience writing operators and other distributed systems (with or without K8S) than me, and we arrived quickly at the DB solution as the most scalable.

I wasn't really thinking about cancel, but it is a good feature to have. I can "fake" a cancel by just updating max number of trials to be the current number of trials already run. Or maybe set the number of parallel trials to 0?? Never tried it but it could work.

I was mainly thinking about high level architecture of Katib, but I may be wrong so anyone please correct me if so:

The split between Katib Controller and Suggestion Service. Suggestion service is a separate component for scalability on the "experiment" level. You will want to spin up a new suggestion service per experiment to make sure you are capable of running 100s to 1000s of experiments. This allows the suggestion algorithm calculation to be distributed properly.
The stateless microservice architecture of Suggestion Service. The suggestion service should have no state, and serve purely as a computational service. This means every previous suggestion and the result of the trial will have to be passed to the suggestion service, as well as how many total suggestions we want (incl new trials). This is a pretty big load...and can get seriously large. A bit annoying of an architecture, but it is "safe" from the perils of stateful services.

I don't foresee how configmaps can be scalable for this problem. A database effectively offloads the state storage of the Katib controller so it's outside of process memory. It makes it easier to pass this storage/state along to other services. It also makes it easy to paginate which is critical for a scalable # of suggestions.

Not incredibly attached to whatever proposal or architecture, but I do think internet-text back and forth will not get the points/analysis across effectively.

andreyvelich commented 1 year ago

I wasn't really thinking about cancel, but it is a good feature to have. I can "fake" a cancel by just updating max number of trials to be the current number of trials already run.

Yes, that should work @robertzsun-dev.

This allows the suggestion algorithm calculation to be distributed properly.

That's correct. You can read more about Suggestion proposal that was introduced by @gaocegege in 2019 here: https://github.com/kubeflow/katib/blob/master/docs/proposals/suggestion.md

The suggestion service should have no state, and serve purely as a computational service.

I believe, it is not always true. Some Suggestion services have state. For example, we store recorded Trials for SkOpt Optimize Suggestion. That allows us to tell about only newly created Trials to the Skopt: https://github.com/kubeflow/katib/blob/master/pkg/suggestion/v1beta1/skopt/base_service.py#LL110C43-L110C58.

I think, we can start Google doc to collaborate whether we should chose the DB approach or ConfigMaps. After that, we can convert it to one of the Katib Proposals WDYT @tenzen-y @robertzsun-dev @johnugeorge @nielsmeima ?

Historically, we've been using Katib DB to store data (e.g. metrics) that we can't store in etcd. Also, usually for the Experiment results only CurrentOptimalTrial is required. IMO, if we use ConfigMap approach, do we really need Katib DB to store metrics ?

gaocegege commented 1 year ago

I believe, it is not always true. Some Suggestion services have state. For example, we store recorded Trials for SkOpt Optimize Suggestion. That allows us to tell about only newly created Trials to the Skopt: https://github.com/kubeflow/katib/blob/master/pkg/suggestion/v1beta1/skopt/base_service.py#LL110C43-L110C58.

Yeah, some algorithms (in skopt and some others) we used require state.

Personally, prefer config maps but look forward to the proposal.

robertzsun-dev commented 1 year ago

Sounds good, happy to contribute to the doc.

@gaocegege - what would happen if the suggestion pod crashes or gets preempted and it loses the in-memory state?

tenzen-y commented 1 year ago

@robertzsun-dev Thanks for sharing.

I think, we can start Google doc to collaborate whether we should chose the DB approach or ConfigMaps. After that, we can convert it to one of the Katib Proposals WDYT @tenzen-y @robertzsun-dev @johnugeorge @nielsmeima ?

@andreyvelich Agree.

I'm happy with participating in the discussion on google docs, although my bandwidth for the katib is limited since I'm focusing on distributed training and job scheduling in this quarter.

IMO, if we use ConfigMap approach, do we really need Katib DB to store metrics ?

Good point. We may need to consider more clean architecture for the Stable Katib version (v1).

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

andreyvelich commented 1 year ago

/lifecycle frozen /help

google-oss-prow[bot] commented 1 year ago

@andreyvelich: This request has been marked as needing help from a contributor.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed by commenting with the /remove-help command.

In response to [this](https://github.com/kubeflow/katib/issues/1847): >/lifecycle frozen >/help Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

kubeflow / katib

Experiment stuck due to hitting `Suggestion` custom resource size limits #1847