core/storage: with rethinkdb being closed, what is our path forward?

aeneasr commented 8 years ago

This is the place to discuss things like:

Which database should we use (please add cons/pros)?
Should we continue with in-memory pub/sub (pros/cons)?

And of course, anything that you think is important on this topic!

SQL

(+) well supported on platforms like GCP, AWS, Azure
(+) ops know how to deal with SQL dbs
(+) SQL usually in every stack somewhere, as hydra is targeting existing infrastructure this is cool
(+) Go's SQL driver allows basically any SQL db
(-) read performance (probably) slower than k/v store like memcached/redis/etcd
(-) read performance slower than in-memory
(-) schema upgrades usually suck with RDBMS
... please extend
NoSQL (Mongo, Couch)
(+) schema upgrades are easier because no schema
(-) one connector per database, not a one-fits-all solution
(?) I don't know how read performance compares to SQL, probably depends on queries
... please extend
Redis
(+) fast
(?) durability seems to be an issue with redis as this needs to be handled by operators who know how to make sure that this actually happens
(-) primary a cache solution, might throw people off (why is this using redis?)
(-) dealing with regexp policies will be tricky
... please extend
Rethinkdb
(+) already implemented
(+) pub/sub concept is really cool, allows to keep stuff in memory in hydra
(-) pub/sub is not reliable yet and it might take a while to get there
(-) rethink company is shutting down, maybe future at stripe / community
(-) need something to limit in-memory data like LRU or similar

waynerobinson commented 8 years ago

There are a number of DB backends that could be good to use.

Redis is a big, established, mature key/value store with some very nifty atomic operations around pub/sub, bitmaps, set and list management and is very, very fast (100-700k QPS on commodity hardware). It's not just a caching store. It also has the benefit of being familiar to a large number of developers and devops and may already be part of a systems existing infrastructure.

Postgres/MySQL are also established players and are very mature and reliable. They are full RDMSes though and may be slower, but again from a developers perspective, they will probably have one of these already in use. The less education you have to do of new developers, the more likely they are to pick up and use your library.

I personally think the in-memory pub/sub of changes to tokens is wasted effort and will cause issues in production without making a significant difference to overall performance/throughput.

Pub/sub will always end up with a delay between the write and propagation to all the other Hydra instances. If you're running more than one Hydra instance, this propagation delay will cause a non-insignificant amount of token verification errors unless the load balancer maintains state (more wasted memory) or falls back to a DB query on a fail (even more complex caching engine).

Also, if you're effectively keeping an in-memory data cache in each Hydra instance, it needs to be running on hardware big enough to hold (at a minimum) all current, unexpired access and refresh tokens. This puts some significant limits on the scalability of Hydra as you basically have to scale each instance to be as big as your DB instance.

Finally, you need to ask, is Hydra a great authentication engine… or is it a caching/DB engine. Redis, Postgres, MySQL, etc are all tools purposely built to handle large quantities of data safely, efficiently and be as performant as possible. Does Hydra really need to replicate something like this to be a great authentication engine?

aeneasr commented 8 years ago

I am collecting some publicly available information on this topic. Feel free to extend this list:

http://code.jjb.cc/benchmarking-postgres-vs-redis
https://gist.github.com/NateBarnes/3001890 (faulty results because bug in code, but might be used for reproduction)
http://redis.io/topics/benchmarks
... more soon

janekolszak commented 8 years ago

Maybe just rely on GO's odbc driver and let users decide what they want?

aeneasr commented 8 years ago

@janekolszak ODBC is SQL only, right? So it would not allow users to use e.g. redis/memcached/mongo? However, I like the idea of this. On the con side we would need to rely on basic SQL queries, so no postgres magic.

janekolszak commented 8 years ago

I don't think hydra would benefit from nosql database. Using odbc would make force hydra to use simple queries. This would make writing storages for mongo etc. very easy.

There are some comersial odbc for mongo.

aeneasr commented 8 years ago

I really like the idea going down the ODBC route, it would allow to run hydra on Google Cloud SQL, Amazon SQL, ...

waynerobinson commented 8 years ago

Ick. ODBC is lowest common denominator as database drivers go and using it on anything but Windows can be problematic.

If you want to simplify DB access within Go, it surely has one or more libraries that support multiple DB backends. This would be better than ODBC, which is more of a protocol abstraction and tends to have very poor performance.

Also, I think RDMS is too heavy for Hydra (although it's a good, simple starting point) and that a key/value store like Redis is all that's necessary anyway.

aeneasr commented 8 years ago

@waynerobinson I think what @janekolszak meant is https://github.com/golang/go/wiki/SQLDrivers

aeneasr commented 8 years ago

Here is a critical view of having redis as a primary datastore and what to consider when rolling with this option anyways: https://muut.com/blog/technology/redis-as-primary-datastore-wtf.html

waynerobinson commented 8 years ago

That sounds much better because ODBC itself is… not good. :-p

If you've got any other choice that is.

aeneasr commented 8 years ago

Another point to consider is that we need some sort of regexp matching in our datastore in order to fetch the right policies (by subject). This was actually one of the considerations in going in-memory for policies.

waynerobinson commented 8 years ago

There are a lot of good notes in there on Redis. Using a persistence slave is a little strange but I understand why they did so in their implementation.

I may have over-played the durability aspect. In our experience we have never once had a Redis process terminate uncleanly or had to deal with hardware failure causing data loss. However, these are obvious possibilities.

But you have to consider that you're talking about microseconds between a Redis instance dying or becoming unavailable and it's data not persisting to storage. These are both incredibly rare events and incredibly unlikely to have unpersisted data when they happen.

You need to consider your use case too. Even if there is data lost, what is the end-user impact? If a token is lost, the user is treated as unauthorized and must recomplete that process. This might be slightly unexpected, but it's hardly unprecedented or especially jarring to find out you've been logged out.

I wouldn't want to store my user accounts or three preferences in Redis, and we don't. We don't personally use Redis for our queue backend because we need for durability and guarantees deliverability of our messages because I'm a little insane about it.

But access tokens are largely ephemeral anyway. If we lose one, I don't care, we'll just regenerate. And the probability of losing one in a durability event anyway that I care even less.

aeneasr commented 8 years ago

We're not only talking about tokens though, it's also about OAuth2 Clients, JWKs, policies and maybe more stuff in the future!

ps: I updated the first post

waynerobinson commented 8 years ago

Redis has glob-matching behavior with http://redis.io/commands/scan. It's not regex though.

Policies may make some sense to store locally because they're not really data as much as they are configuration (although our use case will end up with dozens of custom policies per user).

aeneasr commented 8 years ago

I would like to keep policies in the database and follow 12factor. And I don't like configuration files :) It makes Docker much harder

waynerobinson commented 8 years ago

Some comments about Redis and those other datasets. OAuth2 clients, policies and keys change so rarely that the likelihood of a persistence issue (something that's already very rare) effecting one of these types of writes may as well be considered statistically impossible.

waynerobinson commented 8 years ago

One other thing to note. RethinkDB as a database is going away. But Rethink itself is open source. We don't know what type of community support will spring from the death of the company yet, but it doesn't mean you have to abandon support altogether.

aeneasr commented 8 years ago

We should consider what people see, when they see hydra. I think the choice of rethinkdb was reasonable, despite the possible issues which would have been resolved, including missing tokens due to pub/sub latency, lost insert, update and delete notifications.

With redis my fear is that most people see it as a cache layer (yes, it's a datastore too), so what will they think when they read "backed by redis"? It (apparently) includes operational complexity which you don't have with a managed Google Cloud SQL / AWS / Azure one-click deploy. On the other hand, speed is a factor.

One other thing to note. RethinkDB as a database is going away. But Rethink itself is open source. We don't know what type of community support will spring from the death of the company yet, but it doesn't mean you have to abandon support altogether.

Definitely and I think that this process of reflection is important as we get closer to the stable 1.0 release of hydra! I will put rethinkdb on the list in the first post

waynerobinson commented 8 years ago

Maybe it's a bias I have coming from the Ruby world, but it's used all the time here for primary-ish data stores. No one would use it to back a billing system. But high-volume messaging and the like are powered all the time by Redis. And apart from Clients and Policies, the data Hydra is primarily dealing with is largely emphemeral.

aeneasr commented 8 years ago

I would like to introduce API Keys in the 0.6.0 release, which would be used in cases like: "I want Google Maps on my app so I need an API Key that I put in my code". Again, if one API Key is gone - so what? But that's not for Hydra to decide IMHO. Hydra should put in the best efforts to be a solution such as: "Wow, this is so easy to use, resiliant, available, reliable and fast". I think that redis satisfies that only partially. The current approach does not solve this perfectly either, but I think it was closer to it. The question I have is: why not talk a little bit about other possible solutions, such as SQL?

boyvinall commented 8 years ago

Probably worth also to consider the ease of operation in a fault-tolerant scenario. RethinkDB was/is nice here because of the clustering, similarly mongo. CockroachDB is an interesting one for this too though .. and it's using postgres wire protocol. Unfortunately doesn't have the change notifications (yet) ... that's currently scheduled for adding next year. Older SQL do obviously allow readonly replicas with some yak-shaving for slave promotion, but natively-clustered are easier to work with.

Got to agree with not throwing out RethinkDB just yet though. It's good to be aware of plans, but you don't need to actively migrate off it just yet.

Zev23 commented 8 years ago

Couchbase

First of all, Couchbase is not CouchDB and I'm just a user. I do not have any performance benchmark so i will mention things it can be applied to Hydra as storage.

(+) Both K/V (binary) or Document (JSON) store.
(+) Support secondary indexes
(+) Have official Go SDK. https://github.com/couchbase/gocb
(+) N1QL is A LOT like SQL. Can perform CRUD, JOIN and aggregation.
- In RDBMS, FROM sometable clause refers to table in database with fixed schema.
- However in Couchbase, FROM bucket is kind of like database in RDBMS. So documents of any type are in the same bucket.
- Usually each document will have a property "type" to differentiate themselves or prefix type in id "accesstoken::123456"
(+) N1QL has regex. http://developer.couchbase.com/documentation/server/4.0/n1ql/n1ql-language-reference/patternmatchingfun.html
(+) Can leverage Go's database/sql API: https://github.com/couchbase/go_n1ql
(+) Easy scale out/clustering. (No need table/key partitioning upfront)
(+) Master/Master replication
(+) Admin Console is realtime.
(-) Community Edition (4.1) atleast 6-months outdated than Enterprise Edition(4.5)
(+) TTL supported in Go SDK
(-) TTL NOT supported in N1QL
(-) TTL NOT supported in Sync Gateway REST API

In-Memory PUB/SUB

IMHO, go ahead and create 2 set of managers. Let the user decide whether to use in-memory or not.

There are few ways that you can get change feeds from Couchbase. Each with their own traits and limitations.

1. Sync Gateway REST API

If use this method, client must use Sync Gateway REST API for all CRUD because they have extra properties added to the document.
Still able to use Go SDK or N1QL to query the documents. http://stackoverflow.com/questions/36871589/query-sync-gateway-buckets-using-n1ql
Cannot use K/V. Only document.
It provides 2 methods:
- Webhook (Push)
- Calls and send changed document to external service whenever a change occurs.
- If have n endpoints, need to set n webhooks. Not scalable.
- No replay on failure
- More Info: http://developer.couchbase.com/documentation/mobile/current/develop/guides/sync-gateway/server-integration/index.html
- Changes Feed (Pull)
- Multiple clients can request at the same time. Scalable.
- Can replay or start from a specific sequence per request.
- More Info: http://developer.couchbase.com/documentation/mobile/1.3/develop/references/sync-gateway/rest-api/index.html#operation---db--_changes-get
- Client have to perform 2 calls for every watch loop:
  1. Call to _changes to get list changed document ids.
  2. Call to _bulk_get to get the actual documents.

2. dcpl: lightweight Database Change Protocol (DCP) proxy

Experimental? Its not release.
Document CRUD can use GO SDK or N1QL to perform
Its like a mini server that serve change feed through websocket or http continuosly
Cannot start from specific sequence
Only 3 modes:
- 'everything' -- start from the beginning, and listen forever (default)
- 'from_now' -- start from current state, and listen forever
- 'till_now' -- start from the beginning, and listen until current state
Response contains actual document so no need extra step to get document.
https://github.com/couchbaselabs/dcpl

3. Go SDK (gocb)

Build own custom solution since dcpl is build from Go SDK and written in Go.
Use streamingBucket.
No documentation.

Every replication or sync operation Couchbase use the DCP to do it. Recently they start releasing dcp-client sdk for Java. No news for other lanaguages yet.

aeneasr commented 8 years ago

awesome, thanks for the sum up @Zev23

aeneasr commented 8 years ago

I think going down the https://github.com/golang/go/wiki/SQLDrivers route is the smartest thing to do right now. It allows to use any SQL RDBMS or ODBC-ish variant, so it will work with managed solutions like Google Cloud SQL or Amazon RDS.

I am not opposed to have non-traditional databases supported such as Redis, etcd or similar and I think that @waynerobinson has raised some good points on why Redis could rock with Hydra. For the time being, most of the environments that are using Hydra will not require that amount of throughput and low-latency which is why I think that going the SQL "runs everywhere because you have SQL somewhere anyways and if not it's 2 minutes to set up in any cloud environment"-route first is smart. I think it would rock to implement one store and have this statement up:

Hydra runs with all well-known SQL-compatible databases, including Couchbase (there is a couchbase driver for go SQL), MS SQL, MySQL, ODBC, Oracle, QL, Postgres, HANA, SQLite, Sybase, ...

aeneasr commented 8 years ago

Thanks for all the input, this issue is now #292

ory / hydra