nsqio / nsq

A realtime distributed messaging platform
https://nsq.io
MIT License
24.72k stars 2.89k forks source link

Add Retry Mechanism to Handle Message Distribution Errors in NSQ #1475

Closed bagusandrian closed 6 months ago

bagusandrian commented 7 months ago

Issue

The current NSQ implementation lacks proper handling for message distribution errors, particularly in scenarios where errors such as Out of Memory (OOM) or Out of Disk Space occur during message distribution from topic to channel. This results in lost messages for some channels listed on the topic.

Changes Made

I have introduced a retry mechanism to address this issue. The code now includes retry , which is based on the value of msg.maxRetryChannel. This modification aims to improve the reliability of message distribution and prevent message loss when errors such as OOM or disk space exhaustion are encountered.

Proposed Solution

The solution involves checking the value of msg.maxRetryChannel and retrying the message distribution process accordingly. By incorporating this retry mechanism, we aim to enhance the robustness of NSQ in handling errors during message distribution.

Impact

These changes should have a positive impact on the reliability of NSQ, especially in environments where issues like OOM or disk space constraints may arise during message distribution. However, it's important to note any potential side effects and risks associated with the retry mechanism.

I welcome feedback and suggestions for further improvements. This enhancement aims to address a critical issue in NSQ's reliability and prevent message loss in error-prone scenarios.

mreiferson commented 6 months ago

Hi @bagusandrian, I appreciate you taking the time to submit this, but this has been a much discussed aspect of NSQ (see #510, with many links to other threads). If we ever decide to actually invest in meaningfully changing this aspect of NSQ, I don't think that a retry mechanism is the way to go.