Please create CachedMemoryMeterStrategy that caches shallow sizes of classes

daniel-rusu commented 11 months ago

I worked on a JVM memory measurement library that used reflection to analyze deep object graphs and found a very significant performance improvement by simply caching shallow sizes. I don't have references but I believe it was on the order of 10 times faster for some common scenarios.

The implementation would be quite trivial by creating a new CachedMemoryMeterStrategy class which contains:

a HashMap cache from class to shallow size
a regular MemoryMeterStrategy to be used as a fallback when measuring a new type of object that hasn't been encountered before

Scenarios where this is extremely beneficial:

Collections and arrays usually store objects of the same type (or a small number of subtypes)
Collections themselves can wrap every item ending up with many wrappers of the same type (such as Node in HashMap)
Tree structures usually contain a very small number of types (eg. subtree & leaf nodes)

Even if an object graph contains no repeated classes, this allows us to re-use the same MemoryMeter instance to measure a large number of similar objects so that the cache from the first measurement speeds up the measurements of subsequent objects.

daniel-rusu commented 11 months ago

The MemoryMeter.Builder could have a flag that controls caching with a default of false. When enabled, it would select the MemoryMeterStrategy as usual and wrap it in the fairly trivial CachedMemoryMeterStrategy.

The new CachedMemoryMeterStrategy only needs to override the measure method and delegate everything else to the backing strategy.

blerer commented 11 months ago

As Jamm support multithreading you cannot use a simple HashMap and have to use a ConcurrentHashMap or some other concurrent Map. I did some experience with caching in the past. When running Jamm with Java 17 and the Instrumentation strategy the JMH benchmark were showing that the caching approach was actually slowing things done. For other strategies it was effectively beneficial but you then ended up with the issue that you cannot control how much memory Jamm is then using. Considering that if you are serious about performance and quality of the measures you would go for Java 17 and the Instrumentation strategy. I decided to not implement that functionality. Of course, I could have missed something and I am willing to reconsider that choice if you have number that invalidate my findings. :-) In the meantime, we designed Jamm such as anybody should be able to use their custom strategy by creating the MemoryMeter using the MemoryMeter(MemoryMeterStrategy, FieldAndClassFilter, FieldFilter , boolean, MemoryMeterListener.Factory) constructor.

daniel-rusu commented 11 months ago

Thanks, I tried it out in my project (using Kotlin):

class CachedMemoryMeterStrategy(
    val backingStrategy: MemoryMeterStrategy
) : MemoryMeterStrategy by backingStrategy {
    private val cache = ConcurrentHashMap<Class<Any>, Long>()

    override fun measure(entity: Any): Long {
        return cache.getOrPut(entity.javaClass) {
            backingStrategy.measure(entity)
        }
    }
}

Then I created the MemoryMeter with this function:

private fun createMemoryMeter(strategyType: Guess, cacheShallowSizes: Boolean): MemoryMeter {
    val backingStrategy = MemoryMeterStrategies.getInstance().getStrategy(listOf(strategyType))
    val strategy = when (cacheShallowSizes) {
        true -> CachedMemoryMeterStrategy(backingStrategy)
        else -> backingStrategy
    }
    return MemoryMeter(
        strategy,
        Filters.getClassFilters(/* ignoreKnownSingletons = */ true),
        Filters.getFieldFilters(
            /* ignoreKnownSingletons = */ true,
            /* ignoreOuterClassReference = */ false,
            /* ignoreNonStrongReferences = */ true,
        ),
        NoopMemoryMeterListener.FACTORY,
    )
}

Here are some benchmarks on JDK 17 using Jamm 0.4.0 (ran each test 3 times and reported the duration in seconds):

SPECIFICATION:
- 30.8, 29.9, 31.0
SPECIFICATION with Caching:
- 24.2, 24.5, 23.5
UNSAFE:
- 30.1, 30.4, 28.7
UNSAFE with Caching:
- 24.2, 24.0, 23.8
INSTRUMENTATION:
- 24.1, 23.7, 24.2
INSTRUMENTATION with caching:
- 23.9, 23.6, 23.5

So the instrumentation shows a tiny improvement that's close to the margin of error but the other strategies show a healthy 20% improvement. This suggests that there's some other overhead that's dominating the runtime. I suspect that performance would improve much more if each array element weren't enqueued and instead processed directly. But either way, a 20% improvement is a good start.

daniel-rusu commented 11 months ago

Created a separate topic for measuring array elements much more efficiently: https://github.com/jbellis/jamm/issues/66

Once that's implemented, I expect the improvements from caching to become more pronounced as the array overhead is probably the main bottleneck when measuring collections.

The memory saved by #66 should be greater than the cache overhead when dealing with medium-sized collections or larger. The rest can be left up to documentation to explain that when enabling caching, it would cache the shallow size of all classes encountered for the entire duration of the MemoryMeter existence so that will keep growing as new types are encountered.

daniel-rusu commented 11 months ago

@blerer , I just remembered that the caching that I added before also included the relevant fields per class in addition to the shallow size to avoid having to re-traverse the inheritance hierarchy and to avoid re-filtering the fields each time. Thinking about it some more, I also realized that adding caching at the strategy level is not the best location for that. Setting up a temporary cache each time measureDeep is called makes it easier to cache both shallow size & fields without changing the current architecture and also addresses the concern about the memory usage growing unbounded throughout multiple usages.

Since I didn't have an easy way to try this, as a hack, I temporarily copied the MemoryMeter class and added a local cache inside the measureDeep method and found that the performance was now measurably faster than all current strategies (about 20% faster than instrumentation and 35% faster than unsafe or spec).

The extra memory from the temporary cache can be offset by the memory savings of handling arrays more efficiently (I could submit that separately). The impact of a temporary cache would become even larger once arrays are handled more efficiently.

Since my hack is in Kotlin, I wanted to get your go-ahead before putting in the effort to create a pull request in Java. Does this sound good to you?

blerer commented 11 months ago

@daniel-rusu Your proposal is consisting of 3 parts:

a new way to measure array elements
a measurement cache
a field cache

Regarding 1, I answered in the #66 ticket. Regarding 2 and 3, we did experiment with them and choose to stay away from them. For 2 the main reason is that it is not needed on 17 (in practice our JMH benchmarks showed worst performance) unless you use the Unsafe or Specification strategy. Do you have a use case where you are forced to use those strategies and get better performance? For 3 it is pretty hard to evaluate the amount of data that will be loaded in memory. I am fine revisiting those choices in case we have users that have real life performance problem with the library that need to be solved. Our goal was to create a library easy to use without nasty side effects that could hit you by surprise and to document it as well as possible so that the special cases were clear from the start. Running Jamm with Java 17 and Instrumentation should give you the most correct results and a pretty good performance without having to add any kind of caching.

Do you have a use case were Jamm performance is not good enough for you? If yes, could you describe your use case and share numbers?

daniel-rusu commented 11 months ago

@blerer a few clarifications and then I'll include my use-case at the bottom:

The proposal doesn't require all 3 parts as any of them can provide improvements on their own.
I measured a meaningful improvement for all current strategies as the caching (2 & 3) also makes the instrumentation strategy meaningfully faster when measuring a collection of similar objects which should be common.
2 & 3 could be a measurement cache and a field cache however it's more efficient in time & space to bundle them together in a single class metadata cache.
This doesn't require sacrificing correctness as this also meaningfully improves the performance of the instrumentation strategy so you would never switch away from the instrumentation strategy as that will provide the best performance when caching is enabled.
This doesn't introduce any "nasty" side effects as the cache would be temporary memory similar to how Jamm also requires some temporary memory for the measurement stack. I updated ticket #66 to clarify that it does actually cut the memory usage of analyzing arrays in half including a reference to the snippet of code that's responsible for that. Worst-case scenario, this could be behind a flag although most people would probably want it enabled by default.
You might be surprised to find that combining all 3 improvements would result in a reduction (not increase) of memory usage for many common use cases that include collections even after accounting for the cache.
Regarding whether performance is "good enough", I'm sure that everyone would welcome performance improvements. In my case, performance affects the accuracy of my results which sounds wrong at first (see below).

My use-case:

I'm using Jamm to analyze the memory efficiency of custom algorithms to see how they scale as most of these algorithms need to pre-process the data into a custom data structure. To accurately evaluate the memory efficiency of an algorithm, I need to run Monte Carlo simulations with randomized datasets over many variations of parameters that affect the shape and distribution of the dataset. The more simulations that I can run, the lower the variance of the results so I can present higher-accuracy results with lower margins of error. Since I have so many variations to consider in a limited amount of time, the performance is currently limiting the number of runs that I can perform resulting in lower accuracy due to higher variability and margin of error in the distributions.

blerer commented 11 months ago

We use a single MemoryMeter instance for different measurements inside Apache Cassandra. So the application can run for months and due to some bug can sometime end up measuring a significant amount of classes. I am not so much worried by short memory usage those are fine for us. The issue is with more and more memory being used overtime and never released. Feel free to propose a patch. It is always easier to measure and discuss things with a concrete proposal.

daniel-rusu commented 10 months ago

Thanks, submitted PR: https://github.com/jbellis/jamm/pull/68

The cache is temporarily created for each invocation of measureDeep so that should address the issue with my initial proposal about it growing unbounded.

I also included another benchmark with collections that contain somewhat average-looking data along with benchmark results.

jbellis / jamm

Please create CachedMemoryMeterStrategy that caches shallow sizes of classes #65

My use-case: