Does Coherence optimize request routing to stay within racks where possible?

javafanboy commented 2 years ago

I am looking for general advice related to how Coherence routes requests - if a "non-storage enabled" node requests an object from a partitioned cache in the same cluster will the request be routed to the most optimal of the primary or backups (for instance if one of them is in the same rack as the requesting node - determined using the rack id) or will it always go to the node that is the "primary" owner" of the key no matter if that will result in a higher latency? If this optimization exists would it be possible to configure if it also should take asynchronously updated backups (if used) into consideration (i.e. if it is ok to route requests to a possible slighly stale copy in the same rack rather than to one of the synchronously updated ones in another rack)? If only two racks are used (or if having say three racks and using one asynchronoulsy updated backup) it would be quite sweet if this optimization existed as one then always would be routed within the same rack (unless during rebalancing). In fact this optimization could even make "stretch clusters" spaning two data centres possible (in low update frequency use-cases) without using cluster replication (that is not available in CE)...

agleyzer10 commented 2 years ago

Is this what you mean:

https://coherence.community/22.06-SNAPSHOT/docs/#/docs/core/09_backup

javafanboy commented 2 years ago

Thanks that is exactly what I was looking for - I am looking in the documentation for confirmation of if replicas beyond one is always asynchronous (as I think was the case back in the day I first started using Tasngosol Coherence - now trying to get back to using it again and learn what the product is like today) but fail to find it spelled out clearly...

In particular if all replicas are synchrnous (or can be configured to be) it would be interesting to have backup-count set to two and distribute the Coherence cache over VMs in 3 availability zones - with the improvement you linked the latency (and inter AZ billing in AWS) could be minimized (assuming one specifies the "rack" of each VM as the AZ it runs in)...

On Sun, May 15, 2022 at 3:48 PM Alex Gleyzer @.***> wrote:

Is this what you mean:

https://coherence.community/22.06-SNAPSHOT/docs/#/docs/core/09_backup

— Reply to this email directly, view it on GitHub https://github.com/oracle/coherence/issues/63#issuecomment-1126946051, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADXQF2AHROTYA7FXFSLCCLVKD6D5ANCNFSM5V6A5H3A . You are receiving this because you authored the thread.Message ID: @.***>

mgamanho commented 2 years ago

@javafanboy just as a follow up for clarification:

if neither scheduled-backups nor async-backup are set, then mutations will wait for backup requests to be performed before responding, regardless of number of backups. This can lead to slower response times on put/putAlls. async-backup still waits for backup requests to be sent before returning to the client. If your application is read-heavy then you won't be affected too much by this. scheduled-backups does not wait, but can lead to data loss in the time that backups are not done yet.
with "read-locator" there is a possibility to enter "dirty reads" situations, but I think you are already aware of that.

javafanboy commented 2 years ago

Thanks for the clarification!

javafanboy commented 2 years ago

One more question about "backups" - if you lose a whole rack will coherence still maintain the specified backup count - i.e. if I have say three racks and backup count set to two and one rack loses power or network will I be required to have enough RAM capacity in the remaining two racks to hold the primaries and two backups or will only one backup be maintained until at least three racks are available again?

On Sun, May 15, 2022 at 3:48 PM Alex Gleyzer @.***> wrote:

Is this what you mean:

https://coherence.community/22.06-SNAPSHOT/docs/#/docs/core/09_backup

— Reply to this email directly, view it on GitHub https://github.com/oracle/coherence/issues/63#issuecomment-1126946051, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADXQF2AHROTYA7FXFSLCCLVKD6D5ANCNFSM5V6A5H3A . You are receiving this because you authored the thread.Message ID: @.***>

thegridman commented 2 years ago

Basically, yes, you need to have enough RAM to allow for the amount of failure you want to support. Assuming you have configured Coherence so it can be site safe (setting the coherence.site property) or rack safe (setting the coherence.rack property) then failure of a rack, say will mean the data is recovered from backups in the remaining racks, then new backups also allocated in the remaining racks. The only time backups will not be allocated is when you only have one JVM in the cluster. If Coherence did not do this, then when you lost a rack for example, then the data in the caches would be endangered until new members came up on a new rack, any other JVM departure during this time would cause data loss. We do have quorum configurations in Coherence that controls various behaviours, I'm not sure off the top of my head whether backups is one of them. I'll ask, or someone who knows can answer.

thegridman commented 2 years ago

Also, I believe Coherence will not try to totally unbalance a cluster. For example say you had five cluster members on each of two racks. The you lost four members on one rack leaving only one member on that rack and five on the other rack. I'm pretty sure Coherence would not try to allocate all the backups into the one renaming member on the rack.

oracle / coherence

Does Coherence optimize request routing to stay within racks where possible? #63