kubernetes / kubernetes

Production-Grade Container Scheduling and Management
https://kubernetes.io
Apache License 2.0
109.74k stars 39.3k forks source link

client-go ListWatch may only get partial items when user provide a limit option #126770

Closed imliuda closed 3 weeks ago

imliuda commented 3 weeks ago

What happened?

There are some situations user wants to always list resources from etcd. eg, with no limit option, the server may return full list from apiserver cache, if there are lots of resources, there may be a 60s timeout of apiserver, and user can provide a limit=500 option to force request go through etcd with pagination.

But the following codes have a problem, when an expired error occurs, it will try to do a full list, but user code update the limit options in TweakListOptionsFunc, so this will get only 500 items. In another word, currently, client-go do not support user provided limit option.

func (p *ListPager) List(ctx context.Context, options metav1.ListOptions) (runtime.Object, bool, error) {
    if options.Limit == 0 {
        options.Limit = p.PageSize
    }
       // ...
    for {
        select {
        case <-ctx.Done():
            return nil, paginatedResult, ctx.Err()
        default:
        }
        obj, err := p.PageFn(ctx, options)
        if err != nil {
            if !errors.IsResourceExpired(err) || !p.FullListIfExpired || options.Continue == "" {
                return nil, paginatedResult, err
            }
            options.Limit = 0
            options.Continue = ""
            options.ResourceVersion = requestedResourceVersion
            options.ResourceVersionMatch = requestedResourceVersionMatch
            result, err := p.PageFn(ctx, options)
            return result, paginatedResult, err
        }

What did you expect to happen?

Support user supplied limit option, so there a way user can force request go through etcd, to avoid apiserver 60s timeout problem, this is very import for long running controllers, there controllers usually do very small amount list operations. to some extent, this will also help to reduce apiserver memory usage.

How can we reproduce it (as minimally and precisely as possible)?

  1. create a 50000 pods
  2. start a informer with with TweakListOptionsFunc and set option.Limit to 500, and set option.ResourceVersion to "" if it is "0" (because apiserver dosn't honor limit when resourceVersion is "0")
  3. when the list in operating (we can see lots of pagination request), excute etcd compaction command: etcdctl compaction $(etcdctl endpoint status --write-out="json" | jq '.[0].Status.header.revision' | tr -d '\n')

Anything else we need to know?

No response

Kubernetes version

I think it is for all versions

Cloud provider

I think all clouds

OS version

```console # On Linux: $ cat /etc/os-release # paste output here $ uname -a # paste output here # On Windows: C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture # paste output here ```

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

k8s-ci-robot commented 3 weeks ago

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.
HirazawaUi commented 3 weeks ago

/sig api-machinery

xuzhenglun commented 3 weeks ago

IMO, client side should not be aware of cache because the cache should always make sure the same semantics with etcd. For resourceVersion=0 and limit != 0 case, the behavior is different indeed, but this is the lesser of two evils. Bypass cache will leads lots of promblems, you can find more disscuss here.

As for the long-term plan, WatchList has achieved good results now. In the design, client side fetch objects by using watch instead of list, and all objects are returned from watchcache. It's a Alpha feature, you can enable it by feature gate for apiserver and env variable for client-go. you can find more detail here: KEP-3157.

imliuda commented 3 weeks ago

Actually, there are two problems:

  1. apiserver doesn't honor limit when resourceVersion=0, this is already widely known
  2. with a TweakListOptionsFunc, set limit=500 and resourceVersion to "" if it is "0" (to avoid problem 1), it may cause ListAndWatch's abnormal behavior.

We are facing a problem that there are lots of jobs running in the cluster, but the controller can not list all of them within 60s.

xuzhenglun commented 3 weeks ago

I see, you tweaked the underlay ListWacher inside of Reflector. When Reflector decides to list full contents, your TweakListOptionsFunc unconditional set a limit to the request, and make Reflector abnormal.

Well, I have to say what you have done has some kind of tricky. IMO, some list options, eg the limit and resourceVersion, are internal implementation detail of apiserver and Reflector and others should not be aware or change it. additional, client-side should not be aware of if using cache. so something like rv="" will go through into etcd is not guaranteed, as pervious mentioned. The option will only guarantee the semantics, not implementation detail.

To solve your timeout promblem, maybe you can try to increase the timeout when do list, or just try watchlist feature.

imliuda commented 3 weeks ago

@xuzhenglun Thanks, I will try to solve my problem in some other ways.

xuzhenglun commented 3 weeks ago

I'm closing this issue now, feel free to reopen it or open a new issue if you have something other to discuss.

/closing

xuzhenglun commented 3 weeks ago

/close

k8s-ci-robot commented 3 weeks ago

@xuzhenglun: Closing this issue.

In response to [this](https://github.com/kubernetes/kubernetes/issues/126770#issuecomment-2295760565): >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.