Defer cEOS-lab pod check, update operator version

frasieroh commented 4 months ago

A customer is expressing performance concerns at high scale (~100 instances across ~10 nodes). One of their findings is that cEOS-lab instances appear to start consecutively instead of in parallel.

Because the pod check is baked into (n *Node) Config instead of (n *Node) Status, we don't create the next cEOS-lab custom resource object until the previous pod has started. Now they're created all at once.

The new operator version increases the number of reconcilation workers from 1 to runtime.NumCPU to cope with this change. It turns out the operator spends most of its time generated self-signed RSA certs, depending on what the runtime does with the worker goroutines there may be performance gains there.

Thanks!

coveralls commented 4 months ago

Pull Request Test Coverage Report for Build 8963970828

Details

19 of 20 (95.0%) changed or added relevant lines in 1 file are covered.
No unchanged relevant lines lost coverage.
Overall coverage increased (+0.06%) to 65.176%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
topo/node/arista/arista.go	19	20	95.0%
<!--	Total:	19	20	95.0%	-->

Totals
Change from base Build 8840821359:	0.06%
Covered Lines:	4634
Relevant Lines:	7110

💛 - Coveralls

chrisy commented 4 months ago

As the concerned customer, thanks for this. FWIW, I do have a hack that lets KNE avoid the local wait by launching all pods of a topology in parallel, with obvious concerns about simply shifting the problem to other API's -- in this case, that's what made apparent that the cEOS operator was itself somewhat serializing the work.

alexmasi commented 4 months ago

/gcbrun

openconfig / kne