internetarchive / Zeno

State-of-the-art web crawler 🔱
GNU Affero General Public License v3.0
83 stars 11 forks source link

Have `queue.Enqueue()` handover items to idle workers and optimize workers routines #101

Closed equals215 closed 4 months ago

equals215 commented 4 months ago

Handover Process and Use of runtime.Gosched()

Handover Process

We've implemented a new "handover" mechanism to optimize item processing in our high-performance crawling system. Here's how it works:

  1. When new items are received, we attempt an immediate handover to available workers before enqueueing.
  2. We use a buffered channel (HandoverChannel) with a capacity of 1 for this purpose.
  3. The process loops through each item in a batch:
    • If the handover channel is empty, we put the item in for immediate processing.
    • If the channel is full, we encode then batch the item for later enqueueing.
  4. After processing all items, we check if an item is still in the handover channel. If so, we move it out of the handover channel and encode then batch for enqueueing.

This approach prioritizes immediate processing when workers are available, minimizing latency and reducing pressure on the main queue.

Use of runtime.Gosched()

We've introduced runtime.Gosched() in our worker loop for scenarios when no work is immediately available. Here's why:

  1. Purpose: runtime.Gosched() yields the processor, allowing other goroutines to run.
  2. When used: After checking both the handover channel and the main queue, if no work is found.
  3. Benefits:
    • Prevents tight-loop spinning, reducing CPU usage when idle.
    • Allows other goroutines (even system/invisible ones) to run.
    • Improves overall system responsiveness and resource utilization.
  4. Performance implications:
    • Very lightweight compared to time.Sleep().
    • Minimal impact on latency when work becomes available.
    • Helps balance CPU usage across the system.

It's important to note that runtime.Gosched() doesn't block the goroutine; it simply puts it at the back of the run queue. This means our workers remain highly responsive to new work while being more cooperative with other system components.

equals215 commented 4 months ago

This is 11pm type of code, it's either genius tier or complete forgettable trash code

CorentinB commented 4 months ago

I'm really not sure about the Gosched, I understand the idea behind this but it makes the code way less predictable. What's the real advantage here? The predictability cost seems high.

equals215 commented 4 months ago

In what way does it make the code less predictable? It quite literally the same as a time.Sleep() that would run until all the other routines worked for a cycle. This is really common and help for better resources sharing between goroutines.

equals215 commented 4 months ago

How a time out would be needed? There is either something in the channel or there is not, it's not a waiting question

CorentinB commented 4 months ago

How a time out would be needed? There is either something in the channel or there is not, it's not a waiting question

My bad for forgetting chan receive in select is not blocking! And after more reading, I think the Gosched part is a fine idea. LGTM!