kelemen / netbeans-gradle-project

This project is a NetBeans plugin able to open Gradle based Java projects. The implementation is based on Geertjan Wielenga's plugin.
169 stars 57 forks source link

Huge size of .nb-gradle\private\cache\ folder #258

Open xtracoder opened 8 years ago

xtracoder commented 8 years ago

Size of cache folder for my project is 200MB. This may be OK in general, but some things are looking strange:

  1. Project contains 350 Gradle module, cache contains 316 file - not 1:1 match, but looks to be very close.
  2. Size of each file in cache folder is in range 630Kb..695Kb. 99files have identical file size: 634731 bytes, but they are not binary equal. I have no idea what is inside, but its seems that very repetitive information is stored across them.
  3. Size of the source files is 61MB in total. Size of cache as 3.5 times of sources - it seems to me there is something wrong, most probably causing some performance issues (though - I do not have obvious complains about performance)
kelemen commented 8 years ago

If the size itself is not a problem for you, then I wouldn't worry.

Each file stores the model of the associated subproject, so that if you restart NB and open that project, the plugin can quickly load them without parsing the build script (it will still schedule a model update in the background).

  1. To be honest, I'm not sure why is there only 316 instead of 350.
  2. They are the complete parsed model (as the plugin views them) serialized to a file. I'm guessing that it is possible to make them smaller (even the in-memory size). I think that the large part of the size comes from the dependency lists. This is because currently, the plugin stores two separate sets for each source set (compile / runtime). Since usually these sets have a large common part, I would assume that I could maintain a common set of dependencies for each project.
  3. The size of the source files does not correlate with (or very weakly) the size of the models because those files only store the model parsed from the Gradle script, irrelevant of how many / how large files you have. Also, the size of the cache does not directly have adverse effect on the performance since they are simply loaded once per project.
JustGregory-zz commented 8 years ago

I usually have no problem deleting the dotted folders such as .nb-gradle every few days or weeks, depending on the project. Since they're caches that can be rebuilt, it's no loss to completely remove them.

That said, can an alternative be that the NetBeans gradle caches are under the user's temp directory? Since they shouldn't be committed to any repository and are only relevant to the active model-state, shouldn't they be more deletable?

xtracoder commented 8 years ago

Size on disk is not a problem, but excessive size of data to be generated and processed can be a hidden problem. I may be completely wrong about 'excessive', just giving a 'notice'.

By points:

  1. Some projects are 'parents-aggregates' and do not have 'java' plugin - probably this is the reason of not having cache for them
  2. "large part of the size comes from the dependency lists" - I have no idea, rather probably is. Out of this - I can say that most of my projects share significant part of same dependencies, and it may happen that 90% of data is just duplicated 316 times. Reorganizing cache from per-project to per-root-project (including all dependencies) may make significant difference both on disk and memory consumption.
  3. Point regarding 'size of source' vs. 'size of cache' is of course not very relevant. Just to 'show the difference', which can be a no point of concern ...
kelemen commented 8 years ago

Projects are rather independent in NB, so I cannot just easily share data amongst them. I could do that through some global cache but I don't think reducing the memory footprint worth other adverse effect (many lookup will be slower). Still, within a single project it can be reasonable and that would reduce the memory footprint considerably. That is, for many projects there are four sets:

For the majority projects main.compile contains a very large percentage of all dependencies, so that would roughly reduce the memory footprint to a quarter (this is not exactly true because the plugin maintains some other lists for ClassPath objects). I will look into this (the memory footprint, I'm less concerned with the cache file size).

xtracoder commented 7 years ago

I've got back to the project having that huge number of gradle sub-modules and got a reason to raise this issue once again. I've opened a root module plus around 10 of sub-modules and Gradle plugin anyway started loading all other (let's say hidden) modules. Full cycle finished in around 2 hours :(

It seems like serializing/deserializing of project model is rather memory intensive operation -> nevertheless 'average' memory usage is far from 100%, GC takes very long time - it looks like 80% of loading of individual module is GC, so ... may it have sense to use another serialization format? As far as I understand - to make project workable in NB the only needed staff is "source sets" (what is few folder paths) and dependencies (at most 200 path, and this is at most 50K of memory for veeery long paths). From the screenshot below i see loading of a single project consumes around 250M (5000 times more than 50K) - huge pressure on GC is not a surprise in such conditions.

gc self-profiler

xtracoder commented 7 years ago

Hint: 'clean rebuild' of all these sources takes around 4 minutes (with org.gradle.parallel=true). This means just loading model of entire subtree is around 10-15 seconds - and this the time I would expect as normal for loading of project in NetBeans.

And one more screenshot - heapdump created at some moment while loading projects and then opened in Memory Analyzer: heap-dump

kelemen commented 7 years ago

Thanks for these details.

For overall speed: Though, I can't see your case: According to my experience project loading time is heavily dominated by the configuration resolution which is not done within the NB process. However, as I have read, Gradleware claims that they improved it a lot in 3.2-rc-1, so you might want to give it a shot.

For memory consumption: This is a though issue, the latest release should already have decreased the memory consumption considerably. However, currently each projects have their own dependency list for each source set for quick access (even worse because there is a separate compile and runtime set). This is necessary to be able to quickly provide the classpath for source files. There are two things I think it is possible to do:

I will try to check these things tomorrow but can't promise anything because I'm feeling sick today.

kelemen commented 7 years ago

There is another thing to consider and see if it improves your situation considerably:

xtracoder commented 7 years ago

Re: I will try to check these things tomorrow but can't promise anything because I'm feeling sick today. it is not that urgent :),

... i've checked your comments and need to adjust my vision of the problem

Actually, after looking onto this once again - i suspect that handling of multi-module gradle projects is somewhat wrong. As far as I see from the actions performed - Gradle plugin handles each sub-module individually, i.e. if sub-module A has dependencies on sub-modules B, C, D - it will individually load them once again, introspect, resolve transitive dependencies and so on (it looks like handling of each sub-module takes around 1-2 minutes) ... but from Gradle's perspective all sub-modules are actually one huge model starting from the root folder with settings.gradle, i.e. loading 'root' should be enough to get object model for all child projects from one place.

PS: after all, Gradle like to die with such exception ... couldn't find reason for this yet

Caused by: java.lang.OutOfMemoryError: unable to create new native thread
    at java.lang.Thread.start0(Native Method)
    at java.lang.Thread.start(Thread.java:714)
    at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:950)
    at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1368)
    at org.gradle.internal.concurrent.StoppableExecutorImpl.execute(StoppableExecutorImpl.java:36)
    at org.gradle.launcher.daemon.bootstrap.DaemonOutputConsumer.start(DaemonOutputConsumer.java:82)
xtracoder commented 7 years ago

PS: doing above-mentioned change in JavaExtensionDef and using -J-XX:+UseG1GC -J-XX:+UseStringDeduplication did not make any noticeable difference to me.

kelemen commented 7 years ago

While it looks like that all projects are loaded independently, it is not exactly the case. What actually happens is that after the first subproject of a multi-project build gets loaded, it will fill the in-memory cache with all the other subprojects of the same multi-project build. So, subsequent project loads should be instantaneous. If this is not the case for you, it is likely that the project cache is too small for you (you can increase it in the globa settings: Misc./Gradle/Others.

To be honest, I would expect SerializationUtils2.serializeToFile IO bound. That said, as is now it blocks project loading anyway. So, it might be reasonable to presist project models on a separate thread asynchronously. That would improve the perceived performace. If you are feeling adventurous, you can try updating ProjectModelPersister to move model serialization to a separate thread (e.g., using a new static executor created by NbTaskExecutors.newDefaultFifoExecutor() or a whole new single threaded one). Delaying actually saving the model should not have any adverse effects.

In the mean time, I will try to experiment with File object pooling.

kelemen commented 7 years ago

I have created a _file_referencecache branch where I try to share equivalent File references. To me, it did reduce the memory consumption considerably (even decreased the size of the persistent project model cache). Can you try what effect it has for your large project?

kelemen commented 7 years ago

I have added delayed project model persisting to both master and _file_referencecache. This should improve perceived performance.

xtracoder commented 7 years ago

Ok, it seems that root cause of the problem for me was 'relatively small' cache of the projects you've mentioned above - with default limit of 100 projects are loaded in 1 hour 14minutes (using code from file_reference_cache, see details below). After setting cache size to 500 - load time decreased to 1 min 50 sec, NetBeans shows memory usage of 300MB, what is absolutely OK for me both from 'time' and 'memory usage' perspective. There are 36 modules are actually opened simultaneously out of ~400 modules from 3 different gradle roots. So it seems default cache size of 100 is 'not good' and probably there should be no limit at all - memory usage is acceptable and if someone going to open very huge multi-module project should think about allocating more memory to the process, rather than waiting while non-cached data is going back-and-forth for 2 hours .... anyway - with knowledge of cache size I will go and increase it ... or if i don't do this, i will sit and wait while all this staff is loaded so long, just because i don't known that cache size is not large enough.

Now - about performance with cache size of 100.

I've added logged to file (NB's log is limited by size) operations which are displayed via progress.progress(...); and alsoserializeToFile (i missed to add the same for deserializeFile - so there will be no metrics for it). With cache size of 100 i've got:

Same for cache size =500

So it seems that with 'cache misses' there is huge amount of 'rework' is performed. Not sure - shouldn't it just be re-loaded from file cache?

kelemen commented 7 years ago

Removing the maximum size of the cache would cause a memory leak because it could just grow endlessly (the limit is how many projects you have on your disk). So, I would not do that.

Using the file cache would be reasonable but in the current implementation there is no way to know how up-to-date is the file cache. The current purpose of the file cache is to immediately let you use a project after starting NB without having to wait for project load.

However, I think there is a reasonable solution (work around) for your problem. Actually, the plugin knows that you will have problem, it just ignores it and lets you suffer :). That is, in the introduceModel of DefaultGradleModelLoader, I could increase the cache if it is too small and even report the issue in a balloon.

xtracoder commented 7 years ago

Not sure why do you say 'grow endlessly' - if i open 3 roots with 500 modules - this 500 is the natural limit and new projects will not appear from nowhere. With manual configuration of limit - i'll just always go and set that limit which corresponds to actual max number of modules. I see no reason in doing this.

I expect that after opening another project group with another set of projects - old modules will be unloaded automatically from in-memory cache.

kelemen commented 7 years ago

They will not be removed just because you have closed a project. The least recently used models from the cache will be removed when it reaches its limit. NetBeans might ask the plugin to load a project for many reason and I can't tell if a project will no longer be needed. That is, if it was requested, you might need it again soon.

Regardless, I have added some code to automatically insrease the cache size if you load a huge multi-project build and report this to you. That said, this won't solve the issue when you have multiple multi-project builds loaded. Detecting such issue will need a more complicated logic.

kelemen commented 7 years ago

The plugin should now be able to increase cache size based on the currently opened projects (and also reports if it had to increase the size). This should fix problems of the people unaware of the cache and its effects.

xtracoder commented 7 years ago

Got time to check this - "should now be able to increase cache size based on the currently opened projects".

Behavior with manually configured size = 500

  1. In one minute number of 'pending' projects grows to 43
  2. At this number (43) - projects are being analyzed rather long without change in number of pending projects
  3. After one more minute (or so) - pending projects disappear all-together (i.e. everything gets loaded) (total time - less than 2 minutes)

Behavior with default size = 100 1-2 - the sme

  1. After one more minute (or so) - number of pending project quickly grows to 200+ and ... after 5 minutes of waiting I've stopped experiment because it looks like behavior did not change and process will end in around 2 hours.

So ... it seems it does not work.

kelemen commented 7 years ago

How many pending projects you have is not relevant. What I would be curious had you waited, would it increase the project cache size to something acceptable after completion.

That said, I don't expect this solution to be exactly as efficient as if you had set the cache size to a good value manually. My expectation is that in case of your 3 large multi-project builds there will be 5 complete project loads (maybe a few more if you are unlucky). However after that, next time, the cache size should have been increased to a size where it will be able to load your projects without a problem. That said, I still recommend to adjust the cache size manually.

This patch has two purpose: