vmware-archive / octant

Highly extensible platform for developers to better understand the complexity of Kubernetes clusters.
https://octant.dev
Apache License 2.0
6.27k stars 489 forks source link

Inconsistent resource viewer node collection performance #2342

Open mklanjsek opened 3 years ago

mklanjsek commented 3 years ago

While testing the Resource Viewer performance, I noticed big timing inconsistencies in node collection for complex graphs. For example, here is the output for default-kne-trigger custom resource that contains 62 nodes:


--------------Done  62 in 523.447425ms
--------------Visit  default-kne-trigger
--------------Done  62 in 558.421153ms
--------------Visit  default-kne-trigger
--------------Visit  default-kne-trigger
--------------Done  62 in 525.344295ms
--------------Visit  default-kne-trigger
--------------Done  62 in 533.564831ms
--------------Visit  default-kne-trigger
W0419 09:09:17.256195    9364 reflector.go:424] /Users/mklanjsek/workspace/octant/internal/objectstore/dynamic_cache.go:389: watch of *unstructured.Unstructured ended with: an error on the server ("unable to decode an event from the watch stream: unable to decode to metav1.Event") has prevented the request from succeeding
--------------Done  62 in 4.373597887s
--------------Visit  default-kne-trigger
--------------Done  62 in 595.31299ms
--------------Visit  default-kne-trigger
--------------Visit  default-kne-trigger
--------------Done  62 in 3.690218244s
--------------Visit  default-kne-trigger
--------------Done  62 in 539.459178ms
--------------Visit  default-kne-trigger
--------------Done  62 in 646.501354ms
--------------Visit  default-kne-trigger
--------------Done  62 in 5.333976468s
--------------Visit  default-kne-trigger
--------------Done  62 in 563.79564ms
--------------Visit  default-kne-trigger
--------------Done  62 in 556.619341ms
--------------Visit  default-kne-trigger
--------------Visit  default-kne-trigger
--------------Done  62 in 6.541030963s
--------------Visit  default-kne-trigger
--------------Visit  default-kne-trigger
W0419 09:10:13.989941    9364 reflector.go:424] /Users/mklanjsek/workspace/octant/internal/objectstore/dynamic_cache.go:389: watch of *unstructured.Unstructured ended with: an error on the server ("unable to decode an event from the watch stream: unable to decode to metav1.Event") has prevented the request from succeeding
--------------Visit  default-kne-trigger
--------------Visit  default-kne-trigger
--------------Done  62 in 537.349612ms
--------------Visit  default-kne-trigger
--------------Done  62 in 555.102703ms
--------------Visit  default-kne-trigger
--------------Done  62 in 528.610102ms
--------------Visit  default-kne-trigger
W0419 09:10:50.185268    9364 reflector.go:424] /Users/mklanjsek/workspace/octant/internal/objectstore/dynamic_cache.go:389: watch of *unstructured.Unstructured ended with: an error on the server ("unable to decode an event from the watch stream: unable to decode to metav1.Event") has prevented the request from succeeding
--------------Done  62 in 15.03836556s
--------------Visit  default-kne-trigger
--------------Done  62 in 23.622277884s
--------------Visit  default-kne-trigger
W0419 09:11:32.651555    9364 reflector.go:424] /Users/mklanjsek/workspace/octant/internal/objectstore/dynamic_cache.go:389: watch of *unstructured.Unstructured ended with: an error on the server ("unable to decode an event from the watch stream: unable to decode to metav1.Event") has prevented the request from succeeding
--------------Done  62 in 13.051502819s
--------------Visit  default-kne-trigger
--------------Done  62 in 558.249046ms
--------------Visit  default-kne-trigger```

Normal execution of node collection for this resource takes around 550ms, but it fluctuates and goes all the way up to 23 seconds. Can we improve error handling here to provide more consistent behavior?  
mklanjsek commented 3 years ago

Here is how I measured timing in resourceviewer.go:

    now := time.Now()
    for _, object := range objects {
        if object == nil {
            continue
        }
        fmt.Println("--------------Visit ", object.GetName())
        if err := rv.Visit(ctx, object, handler); err != nil {
            return nil, fmt.Errorf("unable to visit %s %s: %w",
                object.GroupVersionKind(),
                object.GetName(),
                err)
        }
    }
    fmt.Println("--------------Done ", len(handler.nodes), "in", time.Since(now))