oracle / coherence

Oracle Coherence Community Edition
https://coherence.community
Universal Permissive License v1.0
427 stars 70 forks source link

COH-23644? #95

Closed javafanboy closed 1 year ago

javafanboy commented 1 year ago

I am working in a project where we due to limitations of using JDK 8 and other components used are "stuck" with Coherence CE 14 and is having a challenge where evicting entries in the local cache (part of a near cache) takes a LONG time.

I realize this is not a supported release so we are looking into solving the problem ourselves but have a specific question that could help us in this...

We are profiling and seeing that the line that seem to take a lot of time is " iterEvict.remove()" in OldCache and is marked with an issue "COH-23644" that we assume introduced the line.

Not sure if this is a reference to an issue in Coherence CE or an internal Oracle issue to the enterprise edition but either way we have not been able to find any info about it and are very curious to what the issue was and why the line was introduced...

Naively when looking at the code it seems to remove keys from an ArrayList of elements to be evicted as it is iterated - it is not clear to us WHY there is a need to do this and in general removing elements from ArrayLists is expensive (at least in the past required copying the array) so we are suspecting this could be at least one reason we see slow eviction (our caches are big and so could the list of keys to evict be)....

Our first work-around that we are testing now is to increase the low-units to be about 99% of the high-units to ensure the size of the array list will be small (then the ~O(n ** 2) effect of the repeated array.remove operation is minimized) - lets see if it works...

We are also considering, as a further work around, try using Caffeine (2.X version ) as a local cache with this Coherence CE release (using the old lower level mechanism that has been available to provide your owen cache implementation) but not sure how well it would work - are experiences of this would be welcome!

I have noticed that the above mentioned line is present also in the latest Coherence CE release so if this is the cause of the problem we are seeing it may be present also with the latest...

javafanboy commented 1 year ago

Can report that the work around to increase the low-units to be very close to high-units (this way ensuring that a small number of entries are evicted each time a "prune" is performed) seem to have improved the situation quite a lot (prune still seem more CPU consuming than expected though and now we instead have very frequent prunes which is not ideal either)...

Woulds appreciate info about IF that line that remove the entries from the list as they are processed is really needed (I would naively assume the whole prune operation is executed inside a "synchronized" block preventing any other operations on the local cache and in that case I do not immediately see the need to remove the elements instead of just letting them be and afterwards empty the whole array at once if it is retained or just discard it if it is temporary) and if so if you have some other suggestions of improvements to the code we can add in a "patch" for our own use... maybe use a double linked list instead of an array list that have a O(1) complexity for deleting an entry?!

mgamanho commented 1 year ago

Hi @javafanboy

So far as I can tell, this is a memory use improvement to address instances of OOM that we saw. It's just to release references early, and I believe should be safe for you to just undo that line. Said references will be released eventually.

If memory usage is not a concern for you, I would say just go ahead and "patch".

We'll investigate this on our side. Thanks for finding it.

javafanboy commented 1 year ago

Thanks for the quick reply - will try a "patch" to remove the line!

On Thu, Feb 23, 2023, 17:36 Maurice Gamanho @.***> wrote:

Hi @javafanboy https://github.com/javafanboy

So far as I can tell, this is a memory use improvement to address instances of OOM that we saw. It's just to release references early, and I believe should be safe for you to just undo that line. Said references will be released eventually.

If memory usage is not a concern for you, I would say just go ahead and "patch".

We'll investigate this on our side. Thanks for finding it.

— Reply to this email directly, view it on GitHub https://github.com/oracle/coherence/issues/95#issuecomment-1442083961, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADXQF6EOO7VPDYZSEO26TDWY6GXRANCNFSM6AAAAAAVFJRRQ4 . You are receiving this because you were mentioned.Message ID: @.***>

javafanboy commented 1 year ago

I have never tried building Coherence before and when following the instructions it eventually reported build complete (including tests - some where skipped) but nothing shows up under the /dist directory?!

My latest try was with the "mvn -am -pl coherence clean install -DskipTests -Dtde.compile.not.required" that reports the build as successful but leaves dist empty (only original readme file is there)?

A quick search for "coherence*.jar" results in: /prj/coherence-core-components/target/coherence-core-components-14.1.1-0-13-SNAPSHOT-sources.jar ./prj/coherence-core-components/target/coherence-core-components-14.1.1-0-13-SNAPSHOT.jar ./prj/coherence-core/target/coherence-core-14.1.1-0-13-SNAPSHOT.jar ./prj/coherence-core/target/coherence-core-14.1.1-0-13-SNAPSHOT-tests.jar ./prj/coherence-core/target/coherence-core-14.1.1-0-13-SNAPSHOT-sources.jar ./prj/coherence-discovery/target/coherence-discovery-14.1.1-0-13-SNAPSHOT.jar ./prj/coherence-docker/target/coherence-docker-14.1.1-0-13-SNAPSHOT.jar ./prj/coherence-http-netty/target/coherence-http-netty-14.1.1-0-13-SNAPSHOT.jar ./prj/coherence-javadoc/target/coherence-javadoc-14.1.1-0-13-SNAPSHOT.jar ./prj/coherence-jcache/target/coherence-jcache-14.1.1-0-13-SNAPSHOT.jar ./prj/coherence-loadbalancer/target/coherence-loadbalancer-14.1.1-0-13-SNAPSHOT.jar ./prj/coherence-login/target/coherence-login-14.1.1-0-13-SNAPSHOT.jar ./prj/coherence-management/target/coherence-management-14.1.1-0-13-SNAPSHOT.jar ./prj/coherence-metrics/target/coherence-metrics-14.1.1-0-13-SNAPSHOT.jar ./prj/coherence-mock/target/coherence-mock-14.1.1-0-13-SNAPSHOT.jar ./prj/coherence-rest/target/coherence-rest-14.1.1-0-13-SNAPSHOT-tests.jar ./prj/coherence-rest/target/coherence-rest-14.1.1-0-13-SNAPSHOT.jar ./prj/coherence-testing-support/target/coherence-testing-support-14.1.1-0-13-SNAPSHOT.jar ./prj/coherence/target/coherence-14.1.1-0-13-SNAPSHOT.jar ./prj/test/performance/framework/target/lib/14.1.1-0/coherence.jar ./prj/test/performance/framework/target/coherence-performance-framework-14.1.1-0-13-SNAPSHOT.jar ./prj/test/performance/psr/target/lib/14.1.1-0/coherence.jar ./prj/test/performance/psr/target/coherence-performance-psr-14.1.1-0-13-SNAPSHOT.jar ./prj/test/performance/target/lib/14.1.1-0/coherence.jar ./tde/core-net/3.0/target/coherence.jar ./tde/core/1.3/ext/coherence-core.jar ./tde/core/1.3/ext/coherence-discovery.jar ./tools/tde/lib/coherence.jar

Any ideas what is up?

I did not get the version using git client but rather by downloading a ZIP (selecting v14.1.1.0 as branch) but I assume that should not make any difference.

Also in the v14.1.1 I downloaded there is a readme.txt (in addition to the README.md) that seem to contain some old / misleading instructions (Oracle internal for the Enterprise edition?) that had me confused at first so I actually executed the config files mentioned in it (the linux one that also runs the common) - they seemed quite harmless when I looked at them but mentioning it in case I could have "fucked up" something by running them....

mgamanho commented 1 year ago

When you build like that (mvn ... install), the jar ends up in the maven repository, usually in ~/.m2/repository/com/oracle/...

Otherwise the main jar can also be found in ./prj/coherence/target/coherence-14.1.1-0-13-SNAPSHOT.jar it's the same file. Useful for debugging or prototyping purposes.

I wouldn't worry about making any mistakes running these config files, Coherence doesn't depend on any "hidden" magic ala Windows registry :) I'll look into those, at any rate.

javafanboy commented 1 year ago

Thanks for the info and happy to hear there are no similarities between Coherence and with Windows :-)

On Fri, Feb 24, 2023 at 4:49 PM Maurice Gamanho @.***> wrote:

When you build like that (mvn ... install), the jar ends up in the maven repository, usually in ~/.m2/repository/com/oracle/...

Otherwise the main jar can also be found in ./prj/coherence/target/coherence-14.1.1-0-13-SNAPSHOT.jar it's the same file. Useful for debugging or prototyping purposes.

I wouldn't worry about making any mistakes running these config files, Coherence doesn't depend on any "hidden" magic ala Windows registry :) I'll look into those, at any rate.

— Reply to this email directly, view it on GitHub https://github.com/oracle/coherence/issues/95#issuecomment-1443877731, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADXQFZGADBL7766F74AKQTWZDJ7DANCNFSM6AAAAAAVFJRRQ4 . You are receiving this because you were mentioned.Message ID: @.***>

javafanboy commented 1 year ago

As the devs in my project mostly use Windows Laptops I can report that I also tried building 14.1.1.0 under Windows but right away the testApproximateDurationToString (DuratiionTest.java) failed with the message below (I have not tried to figure out why). I used Maven 3.9.0 OpenJDK 1.8, Windows 10.

DurationTest.testApproximateDurationToString:110 expected:<1[.]50ms> but was:<1[,]50ms>

thegridman commented 1 year ago

Off the top of my head, I wonder if the DuratiionTest.java failure is a region thing. I notice in your error message it says, expected:<1[.]50ms> but was:<1[,]50ms>, i.e the test expects a decimal point (as this is hard coded in the test code), but your region settings use a comma in place of a decimal point (like we do here in Turkey, although my laptop uses English for it region settings). We run all our CI builds on a number of different OS'es including versions of Windows and they all pass but they are all on US based Jenkins slaves. I'm sure we do not run builds with non-English regions, maybe we should try one and see what happens.

javafanboy commented 1 year ago

The "patch" of removing the line worked - we no longer see minute long delays in eviction where one core run at 100%. No negative effects observed in our testing of removing it.

Do I need to write a "bug report" (this code line still exists in the code base) or will you find a solution based on my findings?