sghpjuikit / player

Audio player and management application.
22 stars 2 forks source link

Improve widget compilation performance #136

Closed sghpjuikit closed 5 years ago

sghpjuikit commented 5 years ago

As of now, widget compilation time (CT) can be up to 5 seconds and for all widgets up to 1 minute!

1 experimental-kotlin-compiler (ekotlinc) ekotlinc can be grabbed same as kotlinc, from Kotlin releases on Github. It has consistently +-2.5 times smaller CT than kontlinc. For reference it is still about 5 times slower than javac, which is 10-15 times faster than kotlinc.

Average CT for mid-sized widget: javac: 300ms ekotlinc: 2s kotlinc: 5s

Pros

2 Parallel If widgets were compiled in parallel, we could cut the total CT on a modern system to a fraction, say 1/4-1/8. Implementing this would be fairly trivial, but besides total CT, single widget CT is also an issue.

3 Bundling kotlinc I think a release should not bundle kotlinc. It should instead be version-checked and downloaded by the application itself lazily when widget compilation is required. This will make it easier to update the application as well as make release significantly lighter. On the other hand, user with no internet connection may end up being unable to even use the application, which raises the question whether compiled widgets should be bundled in release

This would also make it possible to remove kotlin task from gradle and delegate kotlinc management entirely to the application. Is this a good idea? Probably, because its better to manage kotlinc in one place instead of two.

sghpjuikit commented 5 years ago

I did try to implement parallel compilation, but surprisingly, the total CT did not go down as expected. The results were very consistent across multiple runs:

Notes

Conclusions

Strategy

Thus determining thread count is what this comes down to.

Goal Achieve almost complete CPU saturation, avoid stressing CPU more then necessary, support current low-tier and future high-tier CPUs (means across wide range of core count).

Considerations: Using Runtime.getRuntime().availableProcessors() as thread count will achieve complete saturation, but this may be an overkill. Something like coresAvailable-1 will not work as single compilation already stresses all cores. coresAvailable*0.9 may be severely off for low or high core count values.

We can think of core count as a computational resource and estimate total computation needed for compilation (because even if compilation is inherently parallel, the total work done is roughly the same, it merely gets distributed across all cores). Thus we can estimate number of parallel compilations needed by coresAvailable/coresUsed, where coresUsed is sum of all cores' utilizations.

I propose using ceil(coresAvailable/4.0). Explanation: coresAvailable/2 seems to achieve saturation, so toning it down a bit would be a nice middle ground. Ceil() will always round up, which is important so for small core count we do not get extremly low values, like 6 cores -> 1 thread (we get 2 instead). And 4 plays nicely with standard core counts (4, 8, 12, 16), so it is predictable.

I will roll this out. Further tuning may need be done.

sghpjuikit commented 5 years ago

Note: In retrospect it makes sense that kotlinc is CPU-bound, given how slow it is compared to javac. The IO involved is not so much greater.

xeruf commented 5 years ago

I do not agree with dividing by 4. The huge majority of processors right now have 4 cores, which means that it would only use a single thread. What about dividing by 3?

As a sidenote, it is not that surprising that kotlinc is slower than javac, considering how much magic it is doing under the hood (smart-casting, type-inference, parsing infix notation etc.)

downloaded by the application itself lazily when widget compilation is required

lazily? Why should it be downloaded lazily considering it is certainly needed to use the application?

sghpjuikit commented 5 years ago

4 core system is probably utilized enough by single compilation. I used 4 core AMD Phenom X4 3.8GHz at work not long ago and just compiling project in Idea was enough to get 100% usage, causing lags in playback of this application (when JavaFX was used for playback).

On a 4 core system, using more than 1 core means using 2 cores, which means dividing by 2, which already achieves 100% CPU usage, which is an overkill. My system is unexpectedly fine at 100%, but the application ui did start to show significant lags. I do not want to cause system unresponsiveness. And certainly not for 3-6 seconds of background work.

4 is definitely safe option for core count 4 and 6 and there is no real need to overwhelm 16 core systems with too many compilations at once. It is good to have spare computation power left and there is also fact that logical cores (hyperthreading) are inferior to physical (that may explain no gain between 8->16 threads on my system) and Runtime.getRuntime().availableProcessors() reports logical ones. This may leave users with Intel CPUs with disabled hyperthreading underutilizing their system, but it is fine.

One more thing: an unfortunate effect of multiple compilations is an unreliable compilation time. When sequential CT was 5 seconds, in parallel, even if the total CT is better, single widget CT is longer and the value is no longer representative of the work put into it, which is real shame. CT scales up with the parallelization, so now it can be 10 or even 20 seconds instead of 5. It is now impossible to compare individual widget CTs between systems. This is another reason I do not want to push many compilations at once - time duration gets weird, because multiple compilations fight for CPU power and while this means we get to use as much as possible, it also means each individual compilation gets less than it could.

sghpjuikit commented 5 years ago

I agree about kotlinc, just look at scalac, it is known for being slow. However I felt that Jetbrains/people advertize Kotlin compilation to be very fast kotlin-vs-java-compilation. And in practice this may be the case, but we have a difficult case for kotlinc. We do not use gradle or daemons, use unomptimized version (2.5x slower) and it can not warm up. If we used javac like we do kotlinc, compilation would be slower too. We use the jdk/jre javac using ToolProvider.getSystemJavaCompiler() - it may be severely optimized and warmed up.

In the link:

However, no matter what language you use, the Gradle daemon will reduce build times by over 40%.

It may be worth looking at how Intellij Idea uses kotlinc and into specifics of SystemJavaCompiler.

downloaded by the application itself lazily when widget compilation is required

I meant on demand. The compiler may not be needed at all. Provided we bundle compiled widgets, which I think we should. The compiler may be needed for application update.

I did try to implement parallel compilation

I was going to add ekotlinc to the application, but the runtime has no idea about Kotlin nor its version so it will be necessary somehow pass this information down to the application in gradle build task. I think I will just add a gradle.properties flag to use ekotlinc instead and deal with this at build time. For now.

sghpjuikit commented 5 years ago

Update: I have added ekotlinc option to the build script. Only one compiler is installed. The kotlinc directory is updated when developer changes the property.

Test: compiling all widgets, cores: 16 kotlinc 4 threads: 35s, 80-100% CPU ekotlinc 1 thread 1st run: 16s, 10-30% CPU ekotlinc 4 threads 1st run: 10s, 40-90% CPU ekotlinc 4 threads 2nd run: 8s, 20-60% CPU ekotlinc 6 threads 1st run: 9s, 40-80% CPU ekotlinc 6 threads 2nd run: 6s, 60-90% CPU ekotlinc 8 threads 1st run: 8s, 99% CPU ekotlinc 8 threads 2nd run: 6s, 99% CPU

Observations:

Conclusion: Not only does kotlinc take less time to compile, it also achieves this with less total CPU load, which means greater parallelization benefit, which is 30% for kotlinc and 50% for ekotlinc.

What now: 50% parallelization benefit is huge so we will keep using that, with ceil(coresAvailable/4.0) strategy, which works regardless of the compiler used. Ekotlinc turns out to be of tremendous value for us and it will stay. In the future I'd like for the application to manage the compiler, but that should be handled by another task on the way to release.

sghpjuikit commented 5 years ago

Implemented in 56d11ce

majority of processors right now have 4 cores, which means that it would only use a single thread. What about dividing by 3

I want to point out that compilation will internally use all the threads available, it is just that with many threads it will not be able to use them fully, so multiple compilations still improve utilization. Using 1 (application) thread for compilation on a 4 core system simply means that if multiple widgets are to be compiled, they will do so one at a time. On a 4 core system this will achieve satisfactory cpu usage.