mozilla / ActiveData-ETL

The ETL process responsible for filling ActiveData
Mozilla Public License 2.0
1 stars 5 forks source link

ES Ingestion is behind #56

Closed klahnakoski closed 5 years ago

klahnakoski commented 5 years ago

The ES ingestion is 4 days behind.

klahnakoski commented 5 years ago

08:46 ingestion into ES is 4 days behind; it feels like the same ingestion problem from many months ago. Since the majority of the data is coverage, I altered ingestion to include only 10% of the coverage records until it catches up. 08:51 here is a chart of the number of S3 files in the ES ingestion backlog (each s3 files has many ?thousands? of records)

image

08:53 as the etl caught up, to added to the ES ingestion queue much faster than ES can handle it. you can see by the mostly flat slope afterward that ES can (barely) keep up if there was no backlog

klahnakoski commented 5 years ago

update:

image

klahnakoski commented 5 years ago

There are two possible solutions:

  1. Do not backup the latest coverage index (most recent day) - This puts less pressure on the backup machines, which are indexing everything
  2. Invalidate primary backup shards - The primary shards do some extra work. The backup machines have all the primary shards. There is a command that can be sent to ES to invalidate a primary shard; which will promote a replica to be the new primary. Test this.
klahnakoski commented 5 years ago

https://github.com/klahnakoski/esShardBalancer/issues/1

klahnakoski commented 5 years ago

ES ingestion nodes parse and prepare new documents, and send them to the primary shards. The primary shards record the documents and then relay them to replica shards. This means the primary shard is doing some extra work, and sending more data, than a replica.

There are 40 "spot" nodes, which have replicas, and respond to queries. Plus there are 3 "backup" nodes, in another zone, which have all the primary shards. By moving the primaries off of the backup nodes, we should see a performance gain.

outbound bytes

image

The above chart is showing the three backup nodes. Around the 17th you can see typical outbound network, about 500megs/minute. On the 18th you see a spike caused by a lost spot node: This is expected since lost replicas are recovered using the primary shard and we see the network transfer that data out to whatever nodes are taking over. Soon after the spike, before network activity dies down, you can see my efforts in moving the primaries off of these back nodes. Finally, on the 19th, we see the reduced outbound traffic.

For clarity, the traffic is not gone, it has just moved to other nodes in the spot zone, which outnumber these backup nodes by over 10-to-1.

inbound packets

image

The esShardBalancer moves shards around the cluster. To move primaries off the backup nodes, it first moved a replica in to ensure the data is safe, and then moved the primary out. You can see the inbound replicas in this chart. You can also see the inbound traffic does not change between before and after, as expected.

the backlog

image

Ingestion is still behind, but it is looking good. On the 17th and 18th there was no coverage ingestion so ES could catch up. After the 18th coverage ingestion was turned back on, and the shards started swapping, so it is not surprising that the backlog increased; ES ingestion is slow when shards are moving.

On the 19th shard movement calmed down, and the backlog started to fall again. It seems to be falling no faster than before, but coverage is included, which suggests we have doubled the ingestion rate.

messages in progress

image

So, is ingestion faster? Is it fast enough?

background

The spot nodes each have a push_to_es script that is responsible for reading S3 buckets and bulk inserting them into the correct ES index. Work item is read off a SQS; the work item points to an S3 bucket which is read and put into a memory queue; a sentinel is added to the queue which will be used to confirm the work item is complete. The memory queue is drained by another thread: bulk inserting records as fast as it can. Most time is just waiting for ES to confirm. When this other thread sees a sentinel on the memory queue, it confirms the work item is complete with SQS.

This means push_to_es will read multiple work items, and fill the memory queues (one per index) as fast as it can no matter how long ES takes to respond. There is an upper limit; the memory queues have a bound; when the bound is reached, no more work items will be taken off of SQS until ES has ingested more. Given 40 ingestion machines, we would expect k*40 SQS work items to be in progress at any one time. This is what the above chart is showing; the number of work items taken off SQS, but not yet confirmed as in ES.

The above chart is a rough measure of how much push_to_es is waiting on ES.

The spike on the 17th is a big block of perf records: The files are smaller, so many more items fit in the memory queue. We expect this occasionally; ES is too busy ingesting, so work items accumulate in the memory queue for bigger batches later. The 17th shows typical behaviour; about 240 work items in progress (6 per machine on average). The memory queue is full and waiting for ES to ingest. Here is the same, but just the last 24 hours:

image

The drop in the middle of the 18th is where coverage ingestion was turned back on. The work items are very big, so only (part of) one can fit in the memory queue, so we expect about 1 work item per machine, for a total of 40. We see some in-progress spikes, but they are not as large; indicating that ES may no longer be the bottleneck to ingestion.

We must still wait-and-see, but it looks good.