The goal here is to distribute hosts to dequeue from to each worker to avoid repeatedly hitting the same domain while crawling thus avoiding any rate-limiting and ventilate the RPS as much as possible between different hosts
Initial Challenge
The task was to find an efficient method for distributing numbers (i for hosts[i]) within an interval [0,len(hosts)] that can change over time, based on an atomic uint64 counter that increments with each iteration (dequeue). The goal was to achieve a uniform distribution while maintaining high performance.
To facilitate understanding we will name the atomic uint64 counter : x and the interval : y
Key Requirements
Perfect coverage of the interval
Low execution time
Uniformity of distribution
Methodology
We implemented and tested eight different distribution methods:
Simple Modulo (x%y)
Double Modulo (x%(x%y))
Linear Congruential Generator
Xorshift
Prime Multiply
Wang-Jenkins
Fibonacci
Block Permutation
These methods were evaluated using a sample size of 100 million random inputs ranging representing x from 0 to 2^60 - 1, distributing into 150 bins, representing y.
Results
All methods except Double Modulo showed excellent uniformity and perfect coverage. The performance varied significantly among the methods.
Key Findings:
Simple Modulo emerged as the best overall method, offering:
Perfect coverage
Fastest execution time (41.44 seconds)
Excellent uniformity (Coefficient of Variation: 0.0011)
Prime Multiply and Fibonacci also performed well, with slightly longer execution times (48.99 and 49.11 seconds respectively) but equally good uniformity and coverage.
Double Modulo performed poorly, showing significant non-uniformity and leaving empty bins.
More complex methods like Wang-Jenkins, while providing good uniformity, had significantly longer execution times.
The results can be visually observed with a heatmap:
Conclusion:
Simple Modulo was identified as the optimal method for this specific use case. It offers the best balance of perfect coverage, high performance, and excellent uniformity. Its simplicity also contributes to ease of implementation, maintainability, and portability across different systems and programming languages.
It effectively addresses the challenge of distributing numbers in a dynamically changing interval using an incrementing uint64 counter, providing both efficiency and uniformity in the distribution process.
Enqueue/Dequeue access regulation
To further enhance the queue performances and stability two atomic booleans were added to the PersistentGroupedQueue type to provide control over the state of the queue. Two functions CanEnqueue() and CanDequeue() were also added to get the state of the queue.
The Dequeue() function now checks if it's allowed to dequeue, catching whenever Zeno is finishing, allowing workers to enqueue their last URLs prior to returning with a queue.ErrQueueClosed error triggered by CanDequeue() = false, facilitating the finishing process.
Benchmarks
Before changes :
Running benchmarks for Enqueue-Dequeue...
Notes:
- an operation can be either an Enqueue or a Dequeue
- ns/op is the average time taken per batch
goos: darwin
goarch: arm64
pkg: github.com/internetarchive/Zeno/internal/pkg/queue
BenchmarkEnqueueDequeue/EnqueueDequeue-10 1 253605415167 ns/op 1.000 batches 100000 operations 394.3 ops/s
--- BENCH: BenchmarkEnqueueDequeue/EnqueueDequeue-10
queue_test.go:346: βββββββββββββββββββββββ RUN 1 βββββββββββββββββββββββ
queue_test.go:374: Queue file size (megabytes): 10
queue_test.go:377: Enqueue time for 50000 items: 1m53.343481083s
queue_test.go:398: Dequeue time for 50000 items: 2m20.248690709s
queue_test.go:399: Average enqueue time per item: 2.266869ms
queue_test.go:400: Average dequeue time per item: 2.804973ms
PASS
ok github.com/internetarchive/Zeno/internal/pkg/queue 253.841s
After changes :
Running benchmarks for Enqueue-Dequeue...
Notes:
- an operation can be either an Enqueue or a Dequeue
- ns/op is the average time taken per batch
goos: darwin
goarch: arm64
pkg: github.com/internetarchive/Zeno/internal/pkg/queue
BenchmarkEnqueueDequeue/EnqueueDequeue-10 1 264612467250 ns/op 1.000 batches 100000 operations 377.9 ops/s
--- BENCH: BenchmarkEnqueueDequeue/EnqueueDequeue-10
queue_test.go:346: βββββββββββββββββββββββ RUN 1 βββββββββββββββββββββββ
queue_test.go:375: Queue file size (megabytes): 10
queue_test.go:378: Enqueue time for 50000 items: 2m11.310019042s
queue_test.go:400: Dequeue time for 50000 items: 2m13.293973459s
queue_test.go:401: Average enqueue time per item: 2.6262ms
queue_test.go:402: Average dequeue time per item: 2.665879ms
PASS
ok github.com/internetarchive/Zeno/internal/pkg/queue 264.848s
Host rotation
The goal here is to distribute hosts to dequeue from to each worker to avoid repeatedly hitting the same domain while crawling thus avoiding any rate-limiting and ventilate the RPS as much as possible between different hosts
Initial Challenge
The task was to find an efficient method for distributing numbers (
i
forhosts[i]
) within an interval[0,len(hosts)]
that can change over time, based on an atomic uint64 counter that increments with each iteration (dequeue). The goal was to achieve a uniform distribution while maintaining high performance.To facilitate understanding we will name the atomic uint64 counter :
x
and the interval :y
Key Requirements
Methodology
We implemented and tested eight different distribution methods:
x%y
)x%(x%y)
)These methods were evaluated using a sample size of 100 million random inputs ranging representing
x
from 0 to 2^60 - 1, distributing into 150 bins, representingy
.Results
All methods except Double Modulo showed excellent uniformity and perfect coverage. The performance varied significantly among the methods.
Key Findings:
Simple Modulo emerged as the best overall method, offering:
Prime Multiply and Fibonacci also performed well, with slightly longer execution times (48.99 and 49.11 seconds respectively) but equally good uniformity and coverage.
Double Modulo performed poorly, showing significant non-uniformity and leaving empty bins.
More complex methods like Wang-Jenkins, while providing good uniformity, had significantly longer execution times.
The results can be visually observed with a heatmap:
Conclusion:
Simple Modulo was identified as the optimal method for this specific use case. It offers the best balance of perfect coverage, high performance, and excellent uniformity. Its simplicity also contributes to ease of implementation, maintainability, and portability across different systems and programming languages.
It effectively addresses the challenge of distributing numbers in a dynamically changing interval using an incrementing uint64 counter, providing both efficiency and uniformity in the distribution process.
Enqueue/Dequeue access regulation
To further enhance the queue performances and stability two atomic booleans were added to the
PersistentGroupedQueue
type to provide control over the state of the queue. Two functionsCanEnqueue()
andCanDequeue()
were also added to get the state of the queue.The
Dequeue()
function now checks if it's allowed to dequeue, catching whenever Zeno is finishing, allowing workers to enqueue their last URLs prior to returning with aqueue.ErrQueueClosed
error triggered byCanDequeue() = false
, facilitating the finishing process.Benchmarks
Before changes :
After changes :