Manual definition of compile speed

icecc / icecream

Distributed compiler with a central scheduler to share build load

GNU General Public License v2.0

1.6k stars 252 forks source link

Manual definition of compile speed #458

Closed MaxSagebaum closed 4 years ago

MaxSagebaum commented 5 years ago

In our network we have a very heterogeneous pc landscape and also compile job landscape. On my laptop I am mostly building debug binaries which are quite large due to template programming. Therefore icecream assigns a very high speed value to my laptop. Other people are compiling optimized binaries which are much smaller and therefore get a lower speed value.

It would be nice if the scheduler has an argument which fixes the speed of the compile node.

With this option we could "manually" set the priority order for our nodes.

llunak commented 5 years ago

The scheduler decides speed of each node based on how it performs, and it also distributes compile jobs based on this. So if you build your jobs, they will be distributed also to other nodes and it's not just your node that will get speed adjustments because of it, other nodes will get the same treatment. Is this only a theoretical construct or do you have an actual real case where the performance is poor, and if yes, can you describe it in more detail?

Although I have no data to back it up, I expect that manual setting of speed would generally just degrade performance.

MaxSagebaum commented 5 years ago

Yes that is correct and usually it should work. This morning we have just 3 icecc nodes running. The initial speed values are: my laptop (i7-7500U, Speed 250), Bahamut (i5-4670, Speed 600) and Titan(AMD Ryzen 7 1800X, Speed 230). After I compiled the CoDiPack test suite (lots of very small jobs, with debugging enabled) multiple times the values changed to the following: my laptop 250, Bahamut 540, Titan 325. One would expect that my laptop and Bahamut would decrease in speed and Titan should be the fastest. But for some unknown reasons Bahamut stays the fastest. The scheduler is run on Bahamut.

llunak commented 5 years ago

This specific case does not really prove much. Although the speeds do not match the presumed actual speeds of those computers, that was already the case before you ran your test and in fact after the test they improved towards what they should be. The scheduler tries to prioritize the fastest hosts, so in case your build cannot saturate the cluster enough, nodes ranked as slower may not get enough work to improve their standings. The real question is how and why the nodes became ranked incorrectly. You can get a better overview of what's happening if you run your test after a scheduler restart, that way you can watch e.g. in Icemon how the tasks are distributed and what speeds the nodes get. Also note that the node load contributes to the perceived node speed, so a node that has a local CPU load or is relatively low on memory will be penalized in its speed computation.

llunak commented 4 years ago

I still see no convincing argument, and no further info, closing.

timblechmann commented 3 years ago

I still see no convincing argument, and no further info, closing.

i was wondering if this is final? the speed rating seems to be a bit off on my setup, e.g. my intel i7-6500u-based subnotebook has a speed rating of 256, while the rather beefy i9-9900 only has a speed rating of 160.

this speed rating may change over time, but it really depends on how the compile jobs are scheduled. furthermore i've often came across situations where large translation units don't finish on remote machines in time, but and are then rescheduled locally (icecream-sundae lists them as %). in a way i suspect that this leads to the case that when using the "fast" machine to build, and the "slow" machine to timeout, the long compile time of the large translation unit is not used for inferring speed on the slow worker, but it lets the fast machine to appear slower than it is.

if it is not very difficult to implement, it would be great to have this accessible as command line argument for the workers.

MaxSagebaum commented 3 years ago

I would also still be interested in such a solution.

Since, a lot of different people use the system with totally different jobs. It is nearly impossible to track down the culprit in reasonable time.