TransactionManager: Remove startup dependency on underlying KeyValueStore

IanMair commented 8 years ago

TransactionManagers.create must be called in a dropwizard application's run method because it must register the leader/time/lock endpoints. Therefore AtlasDB must start-up cleanly in order for the Application to actually start up. Currently this will not happen if Cassandra is unreachable, which is bad as it inherently requires a startup ordering to services in a stack, which is not a desired state.

tpetracca commented 8 years ago

Tracking internally at PDS-45417

rhero commented 8 years ago

We're going to look into this but need to look over the startup code a bit.

tpetracca commented 8 years ago

I'll let CLockfort comment on what he believes difficulty here to be (considering he wrote all the connection stuff himself), but depending on the amount of lift required we may want to delay any efforts here until we move over to the CQL driver.

clockfort commented 8 years ago

(Couldn't figure out how to write this without referencing some internal projects by name)

If you do this you need to put in additional dropwizard healthChecks; the reason why this fails now is that it was better to fail at startup and not have a running but uselessly broken application that can't talk to the DB.

I don't even know how you'd pull a change like this back into PG safely given that it doesn't do dropwizard style health checking and alerting. Maybe add that feature to PG? Or have the configurable on this side to toggle whether we want to die-hard-and-fast or stay-up-but-braindead?

I think the base assertion here is wrong, really, aka it is already case whether elements wants it or not that there is a logical startup order to many services, but elements chooses to ignore this for management simplicity and hopes everything magically comes up correctly given a certain amount of timeout slop and erroring at startup in all of the interconnected applications. (I'm not trying to be critical, here, I actually like microservices / many simple things coming together in confluence style) PG's version of service management, despite other flaws, does understand service dependencies correctly.

Possible ways forward:

We add a configurable behavior here whether to block on pool startup or block on the first time we actually need to use a connection. (which involves bypassing all of the health checks we do on startup and leaving the application in a bad state, preferably without erroring at all? kind of gross) And we add extra dropwizard healthChecks for this which will alert the user.
Elements service management gets service dependencies (I think they were already working on this? I'm not as connected to this part of the company as tpetracca is)

tpetracca commented 8 years ago

The way I understand it we're never getting service dependency ordering in the elements world and so this needs to be addressed.

I agree that this becomes hard because of PG and that that entire consumer world expects a dependency ordering and for it's main service to be dead if Cassandra cannot service reads. This is just how that world operates and it is not reasonable for us to change that.

Therefore, I think either a configuration point or an in-code specification when creating a TransactionManager is the correct way to handle this situation. Assuming we do that, we can simply not change PG behavior at all.

As for the behavior when we don't fail despite Cassandra being inaccessible I propose the following:

Do not fatally error when Cassandra is not available
Report a healthcheck saying that Cassandra is not available and thus Atlas is not available
When Cassandra becomes available
- Create our client pool as we typically do on startup
- Run the set of sanity checks we typically do on startup. These checks include validation of configurations in one's KeyValueService. If it turns out that the atlas keyvalueservice configuration is invalid as compared to the state stored by the KVS, fatally error and die.

@IanMair @clockfort does that sound reasonable? Do we still want to fatally die on the invalid configuration checks that can only occur when Cassandra becomes available?

clockfort commented 8 years ago

Okay, that sounds reasonable. edit: What's the plan for if someone makes a request to atlas before it's up? Block indefinitely? Throw immediately? Timeout?

tpetracca commented 8 years ago

@clockfort my gut tells me throwing an exception immediately and then requiring the consumer to handle it properly is probably the way to go.

IanMair commented 8 years ago

@tpetracca proposal sounds reasonable. I'm not sure fatally dying on invalid config checks when Cassandra becomes available is desirable here though - can those config checks be done at start up time? Are there examples of KVS config that can't be checked until Cassandra becomes available? If the config is malformed (missing a required field, contains a value with the incorrect type, etc.), I'd expect that to be caught at start up.

And yeah, in reference to @clockfort's question, agree with throwing immediately and requiring proper handling by the consumer.

tpetracca commented 8 years ago

So these configuration checks are not around malformed config (we will do those at startup time) so much as do your configs either agree with or make sense with the actual topology and replication factor settings of your Cassandra cluster as is. And thus the checks need to do reads from Cassandra to actually validate.

For example you have a replicationFactor of 3 and keyspace called "atlasConsumer", but there already exists an "atlasConsumer" keyspace in Cassandra whose replicationFactor is set to 1. In cases like this we would hard fail, but would actually need Cassandra up to know that it's bad.

jboreiko commented 7 years ago

@rhero may want to consider this higher pri work for Q1. Seems like there would be valuable now in particular as we are seeing very frequent service restarts due to the internal cloud infra structure changes. Also the question of how the txManager should react when in the 'pending' state would be a good topic of discussion for our trip. I imagine products will have an opinion on what they would like this to look like, especially if it means a fair amount of work for them in terms of additional error handling.

rhero commented 7 years ago

Will this problem go away partially with an external Timelock service since we don't need to register any endpoints in the Dropwizard App's run method?

In agreement that "availability" and cleaning up the start-up process is becoming more of a priority, and I would like to get to this in Q1.

hsaraogi commented 7 years ago

PR is #1923.

pnepywoda commented 7 years ago

Hit this in circleci integration tests in foundry-core. Pinging to push for the PR to merge :)

gmaretic commented 6 years ago

This was done a long time ago

palantir / atlasdb

TransactionManager: Remove startup dependency on underlying KeyValueStore #930