Improve widget compilation performance

sghpjuikit commented 5 years ago

As of now, widget compilation time (CT) can be up to 5 seconds and for all widgets up to 1 minute!

1 experimental-kotlin-compiler (ekotlinc) ekotlinc can be grabbed same as kotlinc, from Kotlin releases on Github. It has consistently +-2.5 times smaller CT than kontlinc. For reference it is still about 5 times slower than javac, which is 10-15 times faster than kotlinc.

Average CT for mid-sized widget: javac: 300ms ekotlinc: 2s kotlinc: 5s

Pros

widget compiling is something dev ends up doing a lot Cons
widget compiling is not something user does everyday
each os needs its own instance, this complicates crossplatform use
kotlinc weights up to 50MB, experimental kotlinc up to 200MB

2 Parallel If widgets were compiled in parallel, we could cut the total CT on a modern system to a fraction, say 1/4-1/8. Implementing this would be fairly trivial, but besides total CT, single widget CT is also an issue.

3 Bundling kotlinc I think a release should not bundle kotlinc. It should instead be version-checked and downloaded by the application itself lazily when widget compilation is required. This will make it easier to update the application as well as make release significantly lighter. On the other hand, user with no internet connection may end up being unable to even use the application, which raises the question whether compiled widgets should be bundled in release

This would also make it possible to remove kotlin task from gradle and delegate kotlinc management entirely to the application. Is this a good idea? Probably, because its better to manage kotlinc in one place instead of two.

sghpjuikit commented 5 years ago

I did try to implement parallel compilation, but surprisingly, the total CT did not go down as expected. The results were very consistent across multiple runs:

threads: 1 time: 48 s cpu: 0-50%
threads: 2 time: 39 s cpu: 70-90%
threads: 4 time: 36 s cpu: 90-100%
threads: 6 time: 33 s cpu: 95-100%
threads: 8 time: 30 s cpu: 99-100%
threads: 16 time: 30 s cpu: 100%

Notes

stats: Ryzen 1700 8cores+hyperthreading, NVMe M2 SSD
I disabled widget directory monitoring, just in case it would interfere due to many events
hdd (M2 NVMe) usage was negligible (even with 8 cores), <0.5MB/s, so I/O is not a bottleneck

Conclusions

compilation is CPU intensive, and if pushed to limit, CPU is the bottleneck
javac/kotlinc is capable of making use of multiple threads (as many as 16) and thus further parallelization shows unimpressive performance gain
total CP savings are determined not by parallel compilation instances, but absolute CPU compute power, it is possible that low/mid-tier CPU (2-4 cores) can be sufficiently saturated with a single compilation and show small benefits, on the other hand powerful CPUs can still leverage a lot for us to ignore the potential 1 -> 2 cores showed the most benefit, more than 2 -> 8 in fact 2 -> 8 cores scaled linearly 8 -> 16 showed no benefit as CPU was already stressed enough at 8
we will definitely use experimental kotlinc

Strategy

we can saturate cpu usage by running multiple compilation instances at once and achieve 40% decrease in CT on a modern CPU. However this is at a cost of possibly pushing the system (CPU) to its knees. But compiling all widget is a rare and user may be expecting (and wishing) for it, so it should not be a problem if we tweak the thread count a bit, optimally utilizing CPU in 90% range.

Thus determining thread count is what this comes down to.

Goal Achieve almost complete CPU saturation, avoid stressing CPU more then necessary, support current low-tier and future high-tier CPUs (means across wide range of core count).

Considerations: Using Runtime.getRuntime().availableProcessors() as thread count will achieve complete saturation, but this may be an overkill. Something like coresAvailable-1 will not work as single compilation already stresses all cores. coresAvailable*0.9 may be severely off for low or high core count values.

We can think of core count as a computational resource and estimate total computation needed for compilation (because even if compilation is inherently parallel, the total work done is roughly the same, it merely gets distributed across all cores). Thus we can estimate number of parallel compilations needed by coresAvailable/coresUsed, where coresUsed is sum of all cores' utilizations.

I propose using ceil(coresAvailable/4.0). Explanation: coresAvailable/2 seems to achieve saturation, so toning it down a bit would be a nice middle ground. Ceil() will always round up, which is important so for small core count we do not get extremly low values, like 6 cores -> 1 thread (we get 2 instead). And 4 plays nicely with standard core counts (4, 8, 12, 16), so it is predictable.

I will roll this out. Further tuning may need be done.

sghpjuikit commented 5 years ago

Note: In retrospect it makes sense that kotlinc is CPU-bound, given how slow it is compared to javac. The IO involved is not so much greater.

xeruf commented 5 years ago

I do not agree with dividing by 4. The huge majority of processors right now have 4 cores, which means that it would only use a single thread. What about dividing by 3?

As a sidenote, it is not that surprising that kotlinc is slower than javac, considering how much magic it is doing under the hood (smart-casting, type-inference, parsing infix notation etc.)

downloaded by the application itself lazily when widget compilation is required

lazily? Why should it be downloaded lazily considering it is certainly needed to use the application?

sghpjuikit commented 5 years ago

4 core system is probably utilized enough by single compilation. I used 4 core AMD Phenom X4 3.8GHz at work not long ago and just compiling project in Idea was enough to get 100% usage, causing lags in playback of this application (when JavaFX was used for playback).

On a 4 core system, using more than 1 core means using 2 cores, which means dividing by 2, which already achieves 100% CPU usage, which is an overkill. My system is unexpectedly fine at 100%, but the application ui did start to show significant lags. I do not want to cause system unresponsiveness. And certainly not for 3-6 seconds of background work.

4 is definitely safe option for core count 4 and 6 and there is no real need to overwhelm 16 core systems with too many compilations at once. It is good to have spare computation power left and there is also fact that logical cores (hyperthreading) are inferior to physical (that may explain no gain between 8->16 threads on my system) and Runtime.getRuntime().availableProcessors() reports logical ones. This may leave users with Intel CPUs with disabled hyperthreading underutilizing their system, but it is fine.

One more thing: an unfortunate effect of multiple compilations is an unreliable compilation time. When sequential CT was 5 seconds, in parallel, even if the total CT is better, single widget CT is longer and the value is no longer representative of the work put into it, which is real shame. CT scales up with the parallelization, so now it can be 10 or even 20 seconds instead of 5. It is now impossible to compare individual widget CTs between systems. This is another reason I do not want to push many compilations at once - time duration gets weird, because multiple compilations fight for CPU power and while this means we get to use as much as possible, it also means each individual compilation gets less than it could.

sghpjuikit commented 5 years ago

I agree about kotlinc, just look at scalac, it is known for being slow. However I felt that Jetbrains/people advertize Kotlin compilation to be very fast kotlin-vs-java-compilation. And in practice this may be the case, but we have a difficult case for kotlinc. We do not use gradle or daemons, use unomptimized version (2.5x slower) and it can not warm up. If we used javac like we do kotlinc, compilation would be slower too. We use the jdk/jre javac using ToolProvider.getSystemJavaCompiler() - it may be severely optimized and warmed up.

In the link:

However, no matter what language you use, the Gradle daemon will reduce build times by over 40%.

It may be worth looking at how Intellij Idea uses kotlinc and into specifics of SystemJavaCompiler.

downloaded by the application itself lazily when widget compilation is required

I meant on demand. The compiler may not be needed at all. Provided we bundle compiled widgets, which I think we should. The compiler may be needed for application update.

I did try to implement parallel compilation

I was going to add ekotlinc to the application, but the runtime has no idea about Kotlin nor its version so it will be necessary somehow pass this information down to the application in gradle build task. I think I will just add a gradle.properties flag to use ekotlinc instead and deal with this at build time. For now.

sghpjuikit commented 5 years ago

Update: I have added ekotlinc option to the build script. Only one compiler is installed. The kotlinc directory is updated when developer changes the property.

Test: compiling all widgets, cores: 16 kotlinc 4 threads: 35s, 80-100% CPU ekotlinc 1 thread 1st run: 16s, 10-30% CPU ekotlinc 4 threads 1st run: 10s, 40-90% CPU ekotlinc 4 threads 2nd run: 8s, 20-60% CPU ekotlinc 6 threads 1st run: 9s, 40-80% CPU ekotlinc 6 threads 2nd run: 6s, 60-90% CPU ekotlinc 8 threads 1st run: 8s, 99% CPU ekotlinc 8 threads 2nd run: 6s, 99% CPU

Observations:

Once again at 99% CPU application was laggy.
using 1 thread, ekotlinc is roughly 3 times faster (and some widgets still use fast javac!), as expected
using 8 threads, ekotlinc is 5 times faster
ekotlinc on 4 threads makes individual CT only 50% longer and helps with the problem of unreliable time measurement

Conclusion: Not only does kotlinc take less time to compile, it also achieves this with less total CPU load, which means greater parallelization benefit, which is 30% for kotlinc and 50% for ekotlinc.

What now: 50% parallelization benefit is huge so we will keep using that, with ceil(coresAvailable/4.0) strategy, which works regardless of the compiler used. Ekotlinc turns out to be of tremendous value for us and it will stay. In the future I'd like for the application to manage the compiler, but that should be handled by another task on the way to release.

sghpjuikit commented 5 years ago

Implemented in 56d11ce

majority of processors right now have 4 cores, which means that it would only use a single thread. What about dividing by 3

I want to point out that compilation will internally use all the threads available, it is just that with many threads it will not be able to use them fully, so multiple compilations still improve utilization. Using 1 (application) thread for compilation on a 4 core system simply means that if multiple widgets are to be compiled, they will do so one at a time. On a 4 core system this will achieve satisfactory cpu usage.

sghpjuikit / player

Improve widget compilation performance #136