let4be / crusty

Broad Web Crawler
GNU General Public License v3.0
83 stars 3 forks source link

Migrate job management system to Redis #2

Closed let4be closed 3 years ago

let4be commented 3 years ago

While current "queue-like system" on top of clickhouse worked quite well for testing it's no near as good as required for any serious high-volume use

Recently I did some testing on a beefy AWS hardware and fixed some internal bottlenecks(not yet merged) and in some testing scenarios where I could temporary alleviate the last left bottleneck - job distribution(writing new/updating completed/selecting), Crusty was capable of doing over 900MiB/sec - a whooping 7+gbit/sec! on 48 core(96 logical) c5.metal with a 25gbit/s port

New job queue should be solely redis-based using redis modules: https://redis.io/topics/modules-intro rust has good enough library to allow writing redis module logic: https://github.com/RedisLabsModules/redismodule-rs

We will use pre-sharded queue(based on addr_key)

Atomic operations:

  1. Enqueue jobs
  2. Dequeue jobs
  3. Finish jobs

using correct underlying data types(mostly sets and bloom filter for history) + batching and pipelining we can have solid throughput, low cpu usage per redis node, decent reliability and scalability careful expiration could help to avoid memory overflow on redis node - we always discover domains faster than we can process them