openconfig / kne

Apache License 2.0
214 stars 64 forks source link

Defer cEOS-lab pod check, update operator version #534

Closed frasieroh closed 4 months ago

frasieroh commented 4 months ago

A customer is expressing performance concerns at high scale (~100 instances across ~10 nodes). One of their findings is that cEOS-lab instances appear to start consecutively instead of in parallel.

Because the pod check is baked into (n *Node) Config instead of (n *Node) Status, we don't create the next cEOS-lab custom resource object until the previous pod has started. Now they're created all at once.

The new operator version increases the number of reconcilation workers from 1 to runtime.NumCPU to cope with this change. It turns out the operator spends most of its time generated self-signed RSA certs, depending on what the runtime does with the worker goroutines there may be performance gains there.

Thanks!

coveralls commented 4 months ago

Pull Request Test Coverage Report for Build 8963970828

Details


Changes Missing Coverage Covered Lines Changed/Added Lines %
topo/node/arista/arista.go 19 20 95.0%
<!-- Total: 19 20 95.0% -->
Totals Coverage Status
Change from base Build 8840821359: 0.06%
Covered Lines: 4634
Relevant Lines: 7110

💛 - Coveralls
chrisy commented 4 months ago

As the concerned customer, thanks for this. FWIW, I do have a hack that lets KNE avoid the local wait by launching all pods of a topology in parallel, with obvious concerns about simply shifting the problem to other API's -- in this case, that's what made apparent that the cEOS operator was itself somewhat serializing the work.

alexmasi commented 4 months ago

/gcbrun