Lots of session collisions

etscrivner commented 10 years ago

It appears that the ID generation is not sufficiently random or there is some other issue. I see frequent session ID collisions. These collisions are actually a major security risk as I'm seeing users getting logged in as other users (their session data is being overwritten by that of another user). This opens us up to a potential session hijacking problem if a malicious user can find a way to easily reproduce.

Python 2.7.5
Platforms Produced On: Amazon Linux, CentOS, Ubuntu, and Mac OS X
Flask 0.10.1
SimpleKV 0.9.2

This issue is incredibly similar to a previous one (#12). Here's a basic stack trace:

IntegrityError: (IntegrityError) duplicate key value violates unique constraint "sessions_pkey" DETAIL: Key (key)=(e0e17c94f388ba70_53d2ed1f) already exists. 'INSERT INTO sessions (key, value) VALUES (%(key)s, %(value)s)' {'key': 'e0e17c94f388ba70_53d2ed1f', 'value': <psycopg2._psycopg.Binary object at 0x49d2c88>}
File "/usr/lib/python2.7/site-packages/flask/app.py", line 1817, in wsgi_app
    response = self.full_dispatch_request()
File "/usr/lib/python2.7/site-packages/flask/app.py", line 1479, in full_dispatch_request
    response = self.process_response(response)
File "/usr/lib/python2.7/site-packages/flask/app.py", line 1693, in process_response
    self.save_session(ctx.session, response)
File "/usr/lib/python2.7/site-packages/flask/app.py", line 837, in save_session
    return self.session_interface.save_session(self, session, response)
File "/usr/lib/python2.7/site-packages/flaskext/kvsession.py", line 189, in save_session
    self.serialization_method.dumps(dict(session)))
File "/usr/lib/python2.7/site-packages/simplekv/__init__.py", line 137, in put
    return self._put(key, data)
File "/usr/lib/python2.7/site-packages/simplekv/db/sql.py", line 56, in _put
    'value': data
File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 720, in execute
    return meth(self, multiparams, params)
File "/usr/lib64/python2.7/site-packages/sqlalchemy/sql/elements.py", line 317, in _execute_on_connection
    return connection._execute_clauseelement(self, multiparams, params)
File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 817, in _execute_clauseelement
    compiled_sql, distilled_params
File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 947, in _execute_context
    context)
File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 1108, in _handle_dbapi_exception
    exc_info
File "/usr/lib64/python2.7/site-packages/sqlalchemy/util/compat.py", line 185, in raise_from_cause
    reraise(type(exception), exception, tb=exc_tb)
File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 940, in _execute_context
    context)
File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/default.py", line 435, in do_execute
    cursor.execute(statement, parameters)

Unfortunately there are no minimal steps to reproduce except to keep producing new flask sessions until the error occurs. The most consistent reproduction has been when using JMeter to do load testing.

My recommendation is that the way session IDs are generated be completely changed. You should be using something for which collisions are guaranteed to be incredibly rare - perhaps switching to UUIDs for the ID portion of the key would be better.

mbr commented 10 years ago

The method should be sound - it's using 64 bits of an instance of SystemRandom(), which in turn should be using /dev/urandom (or at least be seeded by it). In addition, the current timestamp is used, so for a collision to occur, you'd have to have the same timestamp returned by datetime.utcnow() and the same random number by SystemRandom.getrandbits().

A UUID4 will be using pretty much the same sources of randomness. A UUID1 might prevent your issues in theory, but can only deliver around ~128 or so UUIDs per tick and machine as well.

I'm confident that the basic idea is sound, here's what I think could be the problem:

You have a lot of machines this runs on in parallel and they have little entropy (dedicated servers without hwrngs?) and the same startup state. That'd be hard to do, but somewhat conceivable - I think that's not likely.

Most likely, there's an issue with the implementation either inside the library (my bad) or the surrounding app server. What does your application server look like? Does it fork per request? That + a bad system-random source would be the easiest way to generate duplicate session ids.

I'll have a look at SystemRandom() to see if getrandbits does what I want.

By the way, if you really want to use UUIDs, you can put it to the test: Write a function supporting the "getrandbits()" method that derives it from a generated UUID.

It'd be nice if you could share some more info on your specific deployment though.

mbr commented 10 years ago

From Python's random.py:

class SystemRandom(Random):
    """Alternate random number generator using sources provided
    by the operating system (such as /dev/urandom on Unix or
    CryptGenRandom on Windows).

     Not available on all systems (see os.urandom() for details).
    """

    def random(self):
        """Get the next random number in the range [0.0, 1.0)."""
        return (long(_hexlify(_urandom(7)), 16) >> 3) * RECIP_BPF

    def getrandbits(self, k):
        """getrandbits(k) -> x.  Generates a long int with k random bits."""
        if k <= 0:
            raise ValueError('number of bits must be greater than zero')
        if k != int(k):
            raise TypeError('number of bits should be an integer')
        bytes = (k + 7) // 8                    # bits / 8 and rounded up
        x = long(_hexlify(_urandom(bytes)), 16)
        return x >> (bytes * 8 - k)             # trim excess bits

    def _stub(self, *args, **kwds):
        "Stub method.  Not used for a system random number generator."
        return None
    seed = jumpahead = _stub

    def _notimplemented(self, *args, **kwds):
        "Method should not be called for a system random number generator."
        raise NotImplementedError('System entropy source does not have state.')
    getstate = setstate = _notimplemented

`getrandbits() is just a wrapper for calling os.urandom(), as expected.

Did you try increasing SESSION_KEY_BITS in your config to 128, or even 256?

mbr commented 10 years ago

Actually, the more I think of it, the more likely it seems that its not a security problem - the same session is simply created twice on a storage level.

I assume you're using simplekv's SQL database backend. When updating a session, it will be DELETEed before being INSERTed again. This should happen inside a transaction, but may break if your isolation level is insufficient (or there's an actual bug inside simplekv). The reason it's done this way is because Postgres and others does not support Upserts, see http://lucumr.pocoo.org/2014/2/16/a-case-for-upserts/ for a lengthy treaty on that subject. Some of the issues described there may be the exact same ones you are running into right now.

If that is the case you do not have as glaring a security problem (though I could think of others resulting from, so this is still worth fixing!) - only someone who knows the session id can cause the issue (to verify you could check if duplicate session problems originate from the same IPs as the ones where the user legitimately holds said session) and it will currently cause some information inside the session to be lost (i.e. not written when updaing the session).

Those are my hunches for now, I need a little more information to work with to fix =). You can mail me, of course, if any of that stuff is confidential.

mbr commented 10 years ago

In closing, I've checked on #postgres on IRC, here's my final assessment:

It's an issue because transaction isolation is not enough, actual SQL MERGE/upserts are needed in simplekv.
It's not easily fixed as implementing these in a database-agnostic manner is very hard through SQLAlchemy (and even without).
A solution would most likely require either writing PL/PgSQL Code to handle the situation or a lot of roundtrips per session on the client.
It's not as bad as you think, securitiy wise =).

A pragmatic short term solution for your deployment may be moving to a redis backend, which handles all these things much easier. Sorry that I can't help out more on short notice.

You're very welcome to dig deep into the details of transactions and upserts across databases and fixing the issue by writing a better implementation for SQLAlchemyStore thought, that part would be much appreciated =).

Some reading:

etscrivner commented 10 years ago

@mbr Thanks so much for your reply. Really helps to illuminate what has been a puzzling and frustrating issue. The issue is actually pretty serious for us as even with a small set of concurrent users (around 40 as opposed to the upwards of 10K that we're expecting at launch) we're still seeing this happen with some frequency. Switching to Redis might be the best solution at this point as it will likely be faster and less fraught with problems. Perhaps adding a warning on this plugin as regards using it with pgSQL might be helpful? I'd be more than happy to contribute in any way that I can.

etscrivner commented 10 years ago

I'll add some more info regarding our deployment to this post shortly.

macrojames commented 9 years ago

I also had collisions appearing with FileStorage. Did not debug too much as it seems to works for some time after a webserver reload and I am still in development. I have 5 threads through wsgi running, perhaps this could be a problem if one request wasn't finished and the next Ajax call comes in too soon?

eMerzh commented 9 years ago

I'm not sure yet, but i think i've experienced it too using redis ... so this might be a bigger problem ...

mbr commented 9 years ago

@macrojames, @eMerzh: Can you supply some debugging output? Possibly even a small test case?

flaker commented 9 years ago

I have seen this error too, using memcached as backend. In my case I saw a memcache error (weird time-outs being one of three users on a testing env) after that I tried to login to my app and the data that was available was from another user. I also make use of session.regenerate() for security reasons.

when I see it again (I've seen 3 or 4 times already) I will try to see if it is always related to an error in the persistance mechanism.

mbr commented 9 years ago

I am grateful for any reproducible test case. Unfortunately, I have not run into the issue myself yet, which makes it hard to pin down and fix.

mmautner commented 8 years ago

Any more news on this @mbr @etscrivner ?

@etscrivner what did you end up using for server-side session storage since you found this one unreliable? Sounds like changing your session backend to Redis or using Flask-Session instead is preferable

mbr commented 8 years ago

I haven't followed up on this, as I am still without a reproducible problem =). @etscrivner Details would be appreciated.

@mmautner Do you run into this issue as well? If so, could you maybe post some (minimal?) example code that reproduces the issue?

etscrivner commented 8 years ago

@mbr Whoops my apologies, clicked the wrong button here. I was never able to fully fix this issue despite attempting all manner of solutions. My suspicion is that it's a deeper problem either with how we configured Flask with uwsgi or it with some non-threadlocal data getting stomped. There probably won't be a "minimal test case" except doing something like load-testing with JMeter and waiting for the error. That was the only way I was able to consistently reproduce.

giyyapan commented 8 years ago

@mbr Anything new about this? I'm using flask's default client side session, but I'm also faceing a similar issue that some user logs into another's account with no reason ( at least nothing special according to request logs), and it's also very hard to be reproduced. Could it be the same issue with yours?

mbr commented 8 years ago

@mbr Anything new about this? I'm using flask's default client side session, but I'm also faceing a similar issue that some user logs into another's account with no reason ( at least nothing special according to request logs), and it's also very hard to be reproduced. Could it be the same issue with yours?

What's your setup, exactly? (Backend and WSGI server?)

giyyapan commented 8 years ago

@mbr Oh so sorry! I found myself mentioned wrong person. I was trying to mention @etscrivner back then. This is not a issue about this repo, but since you ask, I'll try to describe what happened. (sorry for my poor English :p )

I'm using flask with gevent, using this command to run server:

  blueware-admin run-program uwsgi --http 0.0.0.0:80 --wsgi-file server.py \
    --callable app \
    --master \
    --enable-threads \
    --single-interpreter \
    --lazy-apps \
    --gevent 100 \
    --logto $LOG_DIR/$LOG_FILE \
    --threaded-logger \
    --workers 2 \
    --logformat "$LOG_FORMATE"

And run gevent.monkey.patch_all() in server.py The blueware-admin in the command is a 3rd-part wrapper which collects information from every request and upload them. As about session, I'm not using any additional config about session, just get and set it.

I'm surprised when I first found some of my user logged into another's account because I was pretty sure the login part was correct. Then I tried to find more information from logs. According to the logs, what happened is a client's user-id in it's session was suddenly changed between two normal requests ( according to the client's ip and user-agent ), and it's really confused me :(

I'm now using a additional cookie to verify user session, not like session, this cookie will only be changed when a user log-in (the session in client's cookie will be changed in every request even it's not modified). And I remove blureware-admin layer from my app to make architecture simpler. Now my startup command is:

uwsgi --http 0.0.0.0:80 --wsgi-file server.py \
    --callable app \
    --master \
    --gevent 100 \
    --logto $LOG_DIR/$LOG_FILE \
    --threaded-logger \
    --workers 2 \
    --logformat "$LOG_FORMATE"

After that, I added some special logs that will let me know when a user's id in cookie is different from it's id in session.

Now few days passed. I was not able to see any of these log, nor new user feedback about this issue. So I can't tell If this issue solved, or it's just not happened again yet.

jeffcwang commented 8 years ago

Hello,

We've been experiencing this issue for about half a year. We finally got things working with the upgrade to SQLAlchemy 1.1 and using UPSERTS with postgres. Then I overrode the put operation in the SQLAlchemyStore by creating a derived class.

class StgoSQLAlchemyStore(SQLAlchemyStore):
    def __init__(self, bind, metadata, tablename):
        SQLAlchemyStore.__init__(self, bind, metadata, tablename)

    def _put(self, key, data):
        con = self.bind.connect()
        with con.begin():
            # delete the old
            con.execute(self.table.delete(self.table.c.key == key))

            insert_stmt = insert(self.table).values(
                key=key,
                value=data
            )

            upsert_stmt = insert_stmt.on_conflict_do_update(
                index_elements=['key'],
                set_=dict(value=data)
            )

            # upsert new
            con.execute(upsert_stmt)

            # commit happens here

        con.close()
        return key

Finally, inject this store into the init of kvsession

session_store = StgoSQLAlchemyStore(engine, db.Model.metadata, 'kvsession')
kvsession.init_app(app=app, session_kvstore=session_store)

I hope this helps others who have been struggling with this issue.

mbr commented 8 years ago

@jeffcwang I'm afraid that will, at best, mask the problem by silencing the error. If User A uses Session X and it gets replaced by User B's session, also with the id X, User A and User B will still "share" a session. Your fix just silences the PostgreSQL error and makes it so that the active session will be that of B, not A.

jeffcwang commented 7 years ago

@mbr sorry for late response -- yes you're correct about this. It's been a while but if I recall, in our case, it was the same user that had their session over-ridden so User A == User B and it's all good. But agreed... not the best solution overall and if collisions are from diff users, could pose a big challenge.

If you have any new suggestions on better fix, please let me know.

mbr commented 7 years ago

If you have any new suggestions on better fix, please let me know.

The issue is I am fairly stumped at this point. Noone seems to be able to be able to reproduce these errors and I did review the code a while back. So far, I have only received workarounds, but not fixes for an actual bug.

I think there may be some miscommunication about guarantees that are given by either the backend or simplekv. For example, in the beginning some people assumed that simplekv would turn their filesystem into a transactional database somehow (which it does not of course).

I would very much love to fix this issue, but with noone being able to replicate it, there's little I can do. I'm willing to look at live systems, time permitting, as well.

mbr / flask-kvsession

Lots of session collisions #25