palantir / atlasdb

Transactional Distributed Database Layer
https://palantir.github.io/atlasdb/
Apache License 2.0
54 stars 10 forks source link

TransactionManager: Remove startup dependency on underlying KeyValueStore #930

Closed IanMair closed 6 years ago

IanMair commented 8 years ago

TransactionManagers.create must be called in a dropwizard application's run method because it must register the leader/time/lock endpoints. Therefore AtlasDB must start-up cleanly in order for the Application to actually start up. Currently this will not happen if Cassandra is unreachable, which is bad as it inherently requires a startup ordering to services in a stack, which is not a desired state.

tpetracca commented 8 years ago

Tracking internally at PDS-45417

rhero commented 8 years ago

We're going to look into this but need to look over the startup code a bit.

tpetracca commented 8 years ago

I'll let CLockfort comment on what he believes difficulty here to be (considering he wrote all the connection stuff himself), but depending on the amount of lift required we may want to delay any efforts here until we move over to the CQL driver.

clockfort commented 8 years ago

(Couldn't figure out how to write this without referencing some internal projects by name)

If you do this you need to put in additional dropwizard healthChecks; the reason why this fails now is that it was better to fail at startup and not have a running but uselessly broken application that can't talk to the DB.

I don't even know how you'd pull a change like this back into PG safely given that it doesn't do dropwizard style health checking and alerting. Maybe add that feature to PG? Or have the configurable on this side to toggle whether we want to die-hard-and-fast or stay-up-but-braindead?

I think the base assertion here is wrong, really, aka it is already case whether elements wants it or not that there is a logical startup order to many services, but elements chooses to ignore this for management simplicity and hopes everything magically comes up correctly given a certain amount of timeout slop and erroring at startup in all of the interconnected applications. (I'm not trying to be critical, here, I actually like microservices / many simple things coming together in confluence style) PG's version of service management, despite other flaws, does understand service dependencies correctly.

Possible ways forward:

tpetracca commented 8 years ago

The way I understand it we're never getting service dependency ordering in the elements world and so this needs to be addressed.

I agree that this becomes hard because of PG and that that entire consumer world expects a dependency ordering and for it's main service to be dead if Cassandra cannot service reads. This is just how that world operates and it is not reasonable for us to change that.

Therefore, I think either a configuration point or an in-code specification when creating a TransactionManager is the correct way to handle this situation. Assuming we do that, we can simply not change PG behavior at all.

As for the behavior when we don't fail despite Cassandra being inaccessible I propose the following:

@IanMair @clockfort does that sound reasonable? Do we still want to fatally die on the invalid configuration checks that can only occur when Cassandra becomes available?

clockfort commented 8 years ago

Okay, that sounds reasonable. edit: What's the plan for if someone makes a request to atlas before it's up? Block indefinitely? Throw immediately? Timeout?

tpetracca commented 8 years ago

@clockfort my gut tells me throwing an exception immediately and then requiring the consumer to handle it properly is probably the way to go.

IanMair commented 8 years ago

@tpetracca proposal sounds reasonable. I'm not sure fatally dying on invalid config checks when Cassandra becomes available is desirable here though - can those config checks be done at start up time? Are there examples of KVS config that can't be checked until Cassandra becomes available? If the config is malformed (missing a required field, contains a value with the incorrect type, etc.), I'd expect that to be caught at start up.

And yeah, in reference to @clockfort's question, agree with throwing immediately and requiring proper handling by the consumer.

tpetracca commented 8 years ago

So these configuration checks are not around malformed config (we will do those at startup time) so much as do your configs either agree with or make sense with the actual topology and replication factor settings of your Cassandra cluster as is. And thus the checks need to do reads from Cassandra to actually validate.

For example you have a replicationFactor of 3 and keyspace called "atlasConsumer", but there already exists an "atlasConsumer" keyspace in Cassandra whose replicationFactor is set to 1. In cases like this we would hard fail, but would actually need Cassandra up to know that it's bad.

jboreiko commented 7 years ago

@rhero may want to consider this higher pri work for Q1. Seems like there would be valuable now in particular as we are seeing very frequent service restarts due to the internal cloud infra structure changes. Also the question of how the txManager should react when in the 'pending' state would be a good topic of discussion for our trip. I imagine products will have an opinion on what they would like this to look like, especially if it means a fair amount of work for them in terms of additional error handling.

rhero commented 7 years ago

Will this problem go away partially with an external Timelock service since we don't need to register any endpoints in the Dropwizard App's run method?

In agreement that "availability" and cleaning up the start-up process is becoming more of a priority, and I would like to get to this in Q1.

hsaraogi commented 7 years ago

PR is #1923.

pnepywoda commented 7 years ago

Hit this in circleci integration tests in foundry-core. Pinging to push for the PR to merge :)

gmaretic commented 6 years ago

This was done a long time ago