silnrsi / smith

font development, testing and release
Other
14 stars 5 forks source link

parallelizing causing big resource issues: smith spawns too many xetex instances and the build gets killed #43

Closed n7s closed 6 years ago

n7s commented 6 years ago

We have noticed that smith seriously strains the CI container (to the point of OOM invocation to kill the process). For example, too many instances of xetex are called in parrallel during the tests stage and it looks like the underlying waf max-es out the available cores (like over 20 instances, even when the number is reduced in the container configuration) and the build then gets cancelled because it runs out of memory.

It would be useful to have a way to serialize big jobs (e.g. reduce the amount of xetex instances being launched together).

n7s commented 6 years ago

maybe the CI steps can include -j4 as the concurrent jobs parameter so it does not try to max out all the available cores?

bobh0303 commented 6 years ago

This sounds like a bug in the CI config, not a smith bug.

n7s commented 6 years ago

Setting the CI config to -j1 I still see multiple xetex instances running in parrallel. And then the build gets cancelled.

n7s commented 6 years ago

more diagnosis and config tweaks needed

n7s commented 6 years ago

this might be completely unrelated but I notice that the configure target detects/lists xetex multiple times, maybe it's the way ret is build in find_program() in waflib/Configure.py ?

tim-eves commented 6 years ago

I've set the JOBS environment to 4 in the template and automatically thus applied to all Font CI projects on TC. This is honoured by waf/smith and doesn't depend on setting the right flag in the command scripts for multiple steps. This is confirmed to fix the issue, which was that waf is discovering the hosts physical number of cores, rather than the assigned number of cores as specified by the kernel to the guest (on most systems those two are the same unless changed by tasksel) The continued problem nico was seeing after setting -j1 was he'd missed a custom step that only existed in the Padauk project rather than the template, which ran smith pdfs. I've removed that step since smith alltests runs that anyway.