internetarchive / Zeno

State-of-the-art web crawler πŸ”±
GNU Affero General Public License v3.0
83 stars 11 forks source link

Implement host rotation and Enqueue/Dequeue access regulation via atomic booleans #99

Closed equals215 closed 4 months ago

equals215 commented 4 months ago

Host rotation

The goal here is to distribute hosts to dequeue from to each worker to avoid repeatedly hitting the same domain while crawling thus avoiding any rate-limiting and ventilate the RPS as much as possible between different hosts

Initial Challenge

The task was to find an efficient method for distributing numbers (i for hosts[i]) within an interval [0,len(hosts)] that can change over time, based on an atomic uint64 counter that increments with each iteration (dequeue). The goal was to achieve a uniform distribution while maintaining high performance.

To facilitate understanding we will name the atomic uint64 counter : x and the interval : y

Key Requirements

  1. Perfect coverage of the interval
  2. Low execution time
  3. Uniformity of distribution

Methodology

We implemented and tested eight different distribution methods:

  1. Simple Modulo (x%y)
  2. Double Modulo (x%(x%y))
  3. Linear Congruential Generator
  4. Xorshift
  5. Prime Multiply
  6. Wang-Jenkins
  7. Fibonacci
  8. Block Permutation

These methods were evaluated using a sample size of 100 million random inputs ranging representing x from 0 to 2^60 - 1, distributing into 150 bins, representing y.

Results

All methods except Double Modulo showed excellent uniformity and perfect coverage. The performance varied significantly among the methods.

Key Findings:

  1. Simple Modulo emerged as the best overall method, offering:

    • Perfect coverage
    • Fastest execution time (41.44 seconds)
    • Excellent uniformity (Coefficient of Variation: 0.0011)
  2. Prime Multiply and Fibonacci also performed well, with slightly longer execution times (48.99 and 49.11 seconds respectively) but equally good uniformity and coverage.

  3. Double Modulo performed poorly, showing significant non-uniformity and leaving empty bins.

  4. More complex methods like Wang-Jenkins, while providing good uniformity, had significantly longer execution times.

The results can be visually observed with a heatmap:
distributions

Conclusion:

Simple Modulo was identified as the optimal method for this specific use case. It offers the best balance of perfect coverage, high performance, and excellent uniformity. Its simplicity also contributes to ease of implementation, maintainability, and portability across different systems and programming languages.

It effectively addresses the challenge of distributing numbers in a dynamically changing interval using an incrementing uint64 counter, providing both efficiency and uniformity in the distribution process.

Enqueue/Dequeue access regulation

To further enhance the queue performances and stability two atomic booleans were added to the PersistentGroupedQueue type to provide control over the state of the queue. Two functions CanEnqueue() and CanDequeue() were also added to get the state of the queue.
The Dequeue() function now checks if it's allowed to dequeue, catching whenever Zeno is finishing, allowing workers to enqueue their last URLs prior to returning with a queue.ErrQueueClosed error triggered by CanDequeue() = false, facilitating the finishing process.

Benchmarks

Before changes :

Running benchmarks for Enqueue-Dequeue...
Notes:
    - an operation can be either an Enqueue or a Dequeue
    - ns/op is the average time taken per batch
goos: darwin
goarch: arm64
pkg: github.com/internetarchive/Zeno/internal/pkg/queue
BenchmarkEnqueueDequeue/EnqueueDequeue-10                    1    253605415167 ns/op             1.000 batches        100000 operations           394.3 ops/s
--- BENCH: BenchmarkEnqueueDequeue/EnqueueDequeue-10
    queue_test.go:346: β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€” RUN 1 β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”
    queue_test.go:374: Queue file size (megabytes): 10
    queue_test.go:377: Enqueue time for 50000 items: 1m53.343481083s
    queue_test.go:398: Dequeue time for 50000 items: 2m20.248690709s
    queue_test.go:399: Average enqueue time per item: 2.266869ms
    queue_test.go:400: Average dequeue time per item: 2.804973ms
PASS
ok      github.com/internetarchive/Zeno/internal/pkg/queue    253.841s

After changes :

Running benchmarks for Enqueue-Dequeue...
Notes:
    - an operation can be either an Enqueue or a Dequeue
    - ns/op is the average time taken per batch
goos: darwin
goarch: arm64
pkg: github.com/internetarchive/Zeno/internal/pkg/queue
BenchmarkEnqueueDequeue/EnqueueDequeue-10                  1    264612467250 ns/op          1.000 batches      100000 operations        377.9 ops/s
--- BENCH: BenchmarkEnqueueDequeue/EnqueueDequeue-10
    queue_test.go:346: β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€” RUN 1 β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”
    queue_test.go:375: Queue file size (megabytes): 10
    queue_test.go:378: Enqueue time for 50000 items: 2m11.310019042s
    queue_test.go:400: Dequeue time for 50000 items: 2m13.293973459s
    queue_test.go:401: Average enqueue time per item: 2.6262ms
    queue_test.go:402: Average dequeue time per item: 2.665879ms
PASS
ok      github.com/internetarchive/Zeno/internal/pkg/queue  264.848s
equals215 commented 4 months ago

@CorentinB can you also check if those changes won't interfere with the way we enqueue items from HQ? I have to go for tonight

CorentinB commented 4 months ago

Thanks for the PR, checking ASAP