opalj / opal

https://www.opal-project.de
Other
50 stars 24 forks source link

Performance of eager computations in Project initialization #221

Open johannesduesing opened 1 week ago

johannesduesing commented 1 week ago

Problem Statement

As discussed in our recent OPAL meeting, we want to understand what operations are performed (eagerly) when initializing a Project instance, and their respective impact on the overall performance. I had a first look and identified the following relevant operations:

O1 runs concurrently to O2 & O3 and is waited for after O3 completes. O4 and O5 run concurrently while the main thread performs some array manipulations, both are waited for when the actual project instance is created - this is when O7 is triggered. O6 runs after the instantiation has completed, then the Project instance is returned.

Empirical Evaluation

I implemented a small patch to OPAL that extracts the runtime of the operations mentioned above. Based on that i wrote an analysis that iterates Maven Central and does the following:

  1. Locate project JAR based on GAV and open a stream for download
  2. Download project JAR and parse it to OPAL ClassFile representation
  3. Download all transitive dependency JARs and parse them to OPAL ClassFile representation (interfaces only)
  4. Initialize a Project instance based on those project- and library class files
  5. Extract performance values for the operations mentioned above
  6. Write the following values into a CSV file: GAV, #ProjectClasses, #Libraries, #LibraryClasses, StreamTime, LoadAndParseProjectCFsTime, LoadAndParseLibraryCFsTime, TotalProjectInitTime, O4Time, O1Time, O5Time, O7Time, O2Time, O3Time, O6Time

A first very basic run on ~1000 GAVs produced the following results: stats.csv. Note that all times are in milliseconds and the LoadAndParse[Project|Library]CFsTime depends on my local internet connection at home.

Let me know if you have any ideas or additional input for me, then i'll run the analysis on our servers and post evaluation results under this issue.

johannesduesing commented 1 week ago

Today i ran the analysis on one of our servers (4 Cores, 30GB Heap Space). Unfortunately it crashed after ~5000 GAVs, i just restarted it with different configurations and hope to obtain some more results. Nevertheless, i did a preliminary evaluation on the results for those 5000 GAVs. Here's an overview:

Operation AVG Time [ms] MEDIAN Time [ms] 75% Quantil [ms]
Project Classes Download & Init 64 11 30
Library Classes Download & Init 1685 594 2092
Project Instance Init 446 37 145
- O1 131 11 41
- O2 ~0 0 0
- O3 17 3 13
- O4 301 18 80
- O5 84 14 63
- O6 1 0 1
- O7 3 0 4

As you can see, the most relevant operations seem to be O4 (computing instance methods) and O5 (computing overriding methods).

errt commented 6 days ago

Thank you for looking into this. I had a glance at the CSV, but didn't yet gain deeper insights. I think the steps that we expected to be the most expensive also ended up dominating the project creation time, with some differences between projects. Are the any insights you gained that would suggest a course of action besides a general "let's try not to compute everything all the time but just when needed"? Keeping in mind that that would probably increase latency because now some of the steps can just be started right away and done in parallel but if it is lazy, neither would be possible.

johannesduesing commented 6 days ago

I do think it's rather tricky to optimize. While instance and overriding methods are the last thing to be performed before the project is created - and therefore could maybe be made lazy - that would impact project validation, which could only be performed in a reduced fashion, or not at all. Maybe we want to come back to the LazyProject / UnsafeProject idea, with a separate class for use-cases where you e.g. only need the class hierarchy. Before we come to some final conclusions, i'd like to a) gather some more data and b) try the same experiments with your additions from #215 - just to see the performance impact.