tumblr / collins

groovy kind of love
tumblr.github.com/collins
Apache License 2.0
572 stars 99 forks source link

CPU frequency is inconsistently collected and persisted #550

Open jyundt opened 7 years ago

jyundt commented 7 years ago

During new asset inductions, only CPU information for the first socket is persisted. CPU information for sockets 2 - N is discarded.

As an example, given the following CPU information from LSHW, only CPU id 0 will be saved to the database. Note the differences between CPU speed:

Id Cores Threads Speed Description
0 8 16 1.560070 Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz Intel Corp.
1 8 16 1.480089 Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz Intel Corp.

As a result of this behavior, collins will drop information from the second socket:

Id Cores Threads Speed Description
0 8 16 1.560070 Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz Intel Corp.
1 8 16 1.560070 Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz Intel Corp.

This problem originally manifested itself while troubleshooting failing lshw XML tests: https://github.com/tumblr/collins/pull/537#discussion_r117804273. We noticed that our servers were reporting different CPU speeds on different sockets as a result of dynamic frequency scaling. This caused tests in LshwHelperSpec to consistently fail.

As pointed out during the discussion in #537, a more appropriate fix would probably involve disabling dynamic frequency scaling in genesis to avoid different CPU speeds on different sockets.

@byxorna @michaeljs1990

byxorna commented 7 years ago

A linked issue against Genesis should be created, to add/update a task to disable speed stepping before lshw collection.

So, collins only stores the CPU speed for the first socket, and only in one dimension? (i.e. CPU_SPEED_GHZ[0]). I wonder if there would be benefit of using dimensionality of tags to represent these values.

jyundt commented 7 years ago

A linked issue against Genesis should be created, to add/update a task to disable speed stepping before lshw collection.

Will do, I can probably get a PR submitted for this as well.

So, collins only stores the CPU speed for the first socket, and only in one dimension? (i.e. CPU_SPEED_GHZ[0]). I wonder if there would be benefit of using dimensionality of tags to represent these values.

Ugh, I think I have this flipped, it's CPU[N] (the last CPU) that will be stored, not CPU[0]. Sorry for the confusion. I just verified by modifying an lshw XML dump with different product/vendor/speed information and inserting the node into collins.

I don't really have a strong preference on this dimensionality either way. Ideally all CPUs should be identical, however this speed stepping tripped us up.