Memcached failover in non-sticky mode

ghost commented 9 years ago

From d3da...@gmail.com on June 24, 2015 15:46:44

A bit about the setup I'm using: 1 haproxy as a load balancer' 2 tomcat6 nodes; 2 memcached nodes; Non-sticky mode. Kryo serialization strategy; Operation and sessionBackup timeouts are default; Locking startegy: auto.

What steps will reproduce the problem?

Start all the nodes
Login into application
Check that session backup is saved in the secondary memcached nodes
Shutdown the primary memcached node
Navigate to some other page in application (sometimes I get dropped to log in screen with new session identifier not sure why this is happening but it's possibly caused by timeouts)
Restore the memcached node (it takes a while for tomcat to detect that node is back to up state and store the backup of session into it. I'm looking for the options to change this timeout)
As the session backup process is triggered by user requests, in this step I'm making some interactions with the application until the session is stored as backup.
Kill other node (which is now primary)
Next interation with application will get me into the login screen (session information lost), but if I'll change the session id to the session that had to be restored then I will be able to use application with that session identifier). Basically it's quite interesting situation and currently I'm not sure what causing this behaviour as I can't stabily reproduce this issue. Any suggestions will be appreciated.

Original issue: http://code.google.com/p/memcached-session-manager/issues/detail?id=228

ghost commented 9 years ago

From d3da...@gmail.com on July 01, 2015 08:42:02

I've investigated this issue a bit more. So the session is lost when there are concurrent requests and one of them is matched by requestIgnorePattern. As far as I understand there's a racing condition which request will get served first. In case it will be the ignored request the session will be lost as backup retrival will not be triggered. When this parameter is omitted in context.xml failover is working as expected but in my case we have a lot of heavy js pages and each request to such page will be generating 30-50 requests to each memcached nodes to update the metadata of the session stored there. So disabling it is not an option.

ghost commented 9 years ago

From martin.grotzke on July 02, 2015 00:37:19

Great that you investigated this more!

So the session is lost when there are concurrent requests and one of them is matched by requestIgnorePattern.

Are you referring to a request that should not match the requestIgnorePattern? Is the pattern incorrect / too broad then?

As far as I understand there's a racing condition which request will get served first

If the browser sends parallel requests (e.g. via ajax), then there's indeed no guarantee which one hits the server first. This would have to be handled on the client side, the server can nothing do about this.

In case it will be the ignored request the session will be lost as backup retrival will not be triggered.

But a request after the ignored one then should trigger backup retrieval, doesn't this happen then?

Are the "heavy js pages" somehow related to the session, or are this just "stateless" resources?

ghost commented 9 years ago

From d3da...@gmail.com on July 02, 2015 07:05:50

About the requestIgnorePattern: pattern matches the png file in my case. Basically I'm trying the following scenario:

Login, both memcached nodes up and session is backed up correctly
Kill primary node
When I'm selecting the menus - png request is sent to backend (css background). Right after I'm clicking the link and calling the controller.

In case if the png request is served first request tracking host valve is not performing the check of the primary node status, session is not recovered from the backup. After it I'm getting new session id which is not contained in any of memcached nodes and following request (controller) is served with this new session id so application is redirecting to log in screen. Currently I'm not sure how this is happening but disabling requestIgnorePattern fixes this issue. This possibly can have something with the spring security session fixation protection or other similiar stuff.

In case controller gets served in first place then failover is working as expected.

Under the heavy js pages I mean that they are requesting a lot of js files while they are loading. These requests don't change session information in any way.

ghost commented 9 years ago

From d3da...@gmail.com on July 03, 2015 07:26:31

I've tried to reproduce this issue on the msm sample app that is hosted on github. The fail-over is working as expected there with the same configuration and same tomcat instance. As there were no resources like png, ico etc. I've added one but it was still working as expected.

Also I've tried to make a fix for this behavior by adding the primary memcached node availability check in RequestTrackingHostValve where ignorePattern is evaluated. As far as I can tell this fix works and failover is working as expected in my application.

ghost commented 9 years ago

From martin.grotzke on July 03, 2015 14:41:16

Can you submit a pull request with your change?

ghost commented 9 years ago

From d3da...@gmail.com on July 06, 2015 06:22:23

Submitted the pull request with possible fix: https://github.com/magro/memcached-session-manager/pull/44

ghost commented 9 years ago

From d3da...@gmail.com on July 14, 2015 08:59:29

Did you have time to look into it by chance?

ghost commented 9 years ago

From martin.grotzke on July 17, 2015 09:22:46

Sorry for the late response, business work took all the time... I had a look at your PR - AFAICS in the case of primary node unavailability requests that otherwise should be ignored then a NOT ignored but go through standard session handling.

I tend to think that while this may solve the specific issue, it's still just a workaround and there is a different root cause.

I'd say that requests that should be ignored should completely bypass session handling, so they should not depend on memcached availability at all. If such requests cause issues this is probably not the case. I'd prefer to find and fix this issue.

What do you think?

ghost commented 9 years ago

From d3day.an...@gmail.com on July 19, 2015 06:32:10

This "fix" was made just to show what I mean and more like a treatment of the symptom then the cause. It's definitely not a solution for the problem. Also I was not able to reproduce this issue with the test app (wicket). So I guess I'll invest a bit more time into investigation of this issue until it will be clear what causing it. Just had a little hope that you'll "magically" find the problem =).

ghost commented 9 years ago

From martin.grotzke on July 21, 2015 13:13:52

Yeah, ok :-) Great that you're investigating this!

magro / memcached-session-manager

Memcached failover in non-sticky mode #267