Multi-DC HA deployment - Githubissues

gmucha commented 8 years ago

Hi, I was wondering on ways to improve resiliency in case of DC failure - and how to marry cross-DC replication with having a federated cluster. While we could set up cross–datacenter replication for C*, this alone would not work due to the fact that the data will be there – but the metadata will be missing Also – since the data wold be replicated into multiple data centers, a federated query would read the same data twice – this would probably break some aggregations like count() / sum().

From our point of view having a hot-hot multi-DC deployment is an important requirement, both in case of ingestion as well as querying.

I was trying to devise a way how to work around this limitation, some options I was considering were:

Create a Cassandra-backed metadata store (this means that probably not all types of filters would be supported – esp. like / regexp). This would basically mean doing something similar to what Kairos is doing (and possibly with similar restrictions on rows with huge tag cardinality). Again, this defeats the whole concept of federation.
Use something like Elasticsearch tribe node to act as a federated cluster (this would not play nice with rolling index policy, again the federation would not work) - this is just a thought.
Use another way to replicate ES (using a Kafka topic?) - or just mirror the pre-ingestion data into different datacenter using Kafka's mirror-maker
Other ideas?

By the way I was wondering what approach did you take with regard to rebuilding Elasticsearch indexes – as far as I understand, there’s no need to scan over all the dat a in Cassandra, only the row keys – is there an efficient way to do so- and what numbers are you seeing when rebuilding indexes? I was wondering if we could live with "normal" federation and only rebuilding Elastic indexes when there's failover - they're not that big and we could replicate other datacenters into different C* keyspaces - if there was a failover, the process would have to regenerate/update only the indexes for "remote" data - similarily, the metric-collecting agent could in this case switch to a different DC.

udoprog commented 8 years ago

Hey,

The setup I'd recommend is building metadata straight from the pipeline, sort of suggestion number three.

Assuming you are using a Kafka setup:

Setup consumer in each DC that writes to Elasticsearch this will cause a bit of extra cross-DC traffic. A lot of this traffic could be avoided by extending something like FFWD to write to a separate metadata topic with a much lower rate. This would be specifically designed to only advertise available time series over.
Configure each shard with Cross-DC tags include API nodes in the cluster from all available DCs.
Since you already have a data consumer, you could just as well ingest data as well, but this will be less consistent (e.g. non-identical rows) than cross DC Cassandra replication. I don't see a need to use MirrorMaker since there will be the same number of consumers in the receiving DC and less moving parts. But if you find this to be necessary, it is certainly an option.

With the above, Cross-DC metadata is not guaranteed to be consistent at all times. Effectively, no two snapshots ever will be. But I consider it 'close enough'. Another downside is that with the current naive round-robin federation pattern you might hit undesirable nodes. The system won't prefer closer nodes with lower latency if they are available. At the very least I'd like to see this in a future patch. Either through latency measurement, or extra metadata like dc_tags in the cluster configuration.

We rate-limit metadata ingestion to around 50k/s, with this, it takes us about 30 minutes to rebuild everything, or build a new index once it rotates. This limit is in place to reduce the overhead of ingestion to Elasticsearch. Without it we quickly kill the the cluster. I also expect that this rate will scale with the size of the cluster, so total time to ingest should not be adversely affected as we grow.

Rebuilding Elasticsearch from Cassandra is slow, and gets slower with the number of rows. We have clusters with hundreds of millions of row keys where iterating over them takes hours (can be faster if parallelized over the token space). I've considered writing a slow, rotating indexes that performs a number of slow, parallel traversals to index everything for the ones who need it. We don't mind that a time series disappears from Elasticsearch when we rotate two times (two weeks), and no data for it has been emitted. It's a feature we enjoy.

As a general principle I prefer to keep everything hot even when not in use to avoid complex failover procedures.

I hope this was helpful!

gmucha commented 8 years ago

Thanks for the write up- very helpful indeed. As far as I understand new rows get written into metadata storage only when new series appear (or, presumably, when a series spills into a new row in Cassandra). - but Heroic tries a metadata write every time a point is added, right? Theoretically, I could set up multiple MetadataBackends and write directly into them, but cross-DC latency would make this too costly for each item written (also, this would mean adding support for lost writes, using this for write and not read only). I guess doing this using Kafka would be nicer (just wondering how to effectively reduce # of metadata messages being written).

The "close enough" metadata is good enough for me, there will definitely exist some disparity between backend and metadata due to different rates at which data replicate.

BTW - as if on cue, Elasticsearch provided a writeup on scaling across datacenters using Kafka. (https://www.elastic.co/blogs/scaling_elasticsearch_across_data_centers_with_kafka)

udoprog commented 8 years ago

Thanks!

The metadata writes are rate-limited, uses a timed write-if-absent cache for each consumer, and uses op_type=create to reduce overhead on duplicates.

I'll mark this as an enhancement for now.

I can see this turning into 1) a tutorial on https://spotify.github.io/heroic/#!/tutorial/index and 2) some improvements to the clustering logic.

stale[bot] commented 5 years ago

⚠ This issue has been automatically marked as stale because it has not been updated for at least two months. If this is a bug and you can still reproduce this error on the master branch, please reply with all of the information you have about it in order to keep the issue open. This issue will automatically be closed in ten days if no further activity occurs. Thank you for all your contributions.

spotify / heroic

Multi-DC HA deployment #14