Possible Memory leak in dlr.py refer to delay logic

pinkdawn commented 8 months ago

In dly.py:

timer= reactor.callLater(self.config.dlr_lookup_retry_delay,
                          self.rejectMessage,
                          message=message,
                          requeue=1)

# If any, clear timer before setting a new one
if msgid in self.requeue_timers:
    timer = self.requeue_timers[msgid]  
        # this is obvious a bug, should be __timer = ...
    if timer.active():
        timer.cancel()
    del self.requeue_timers[msgid]

# Set new timer
self.requeue_timers[msgid] = timer
defer.returnValue(timer)

For 3 or 4 years, this problem keeps bother me, that the dlr service slowly memory leak.

Recently I do a memory dump and found that, the memory are full of delivery_sm messages that got: [msgid:%s] (retrials: %s/%s) DLRMapNotFound: %s Got a DLR for an unknown message id: %s (coded:%s)

And finally it leads me to the above code, it's obvious that, the timer got created, but if it's already in queue, it will be abandoned and assign to the old one !!!!!!! And the new create one will execute but waits for nothing forever.

Kisuke-CZE commented 8 months ago

Do you think you can suggest patch that can fix it (if possible create pull request in upstream) ? I do not understand that part of code. But if there is really this problem, I am interested in putting patch into packages for RHEL clones.

farirat commented 8 months ago

Hello @pinkdawn

Thank you for pointing out this, I confirm there's a bunch of users raising intermittent memory leak issues around the dlrd process, but to date, there's no established diagnotic.

From the code snippet you pinned from managers/dlr.py I can see no issue, here's my hypothesis:

A new timer (Tn) gets intitiated first,
Any old timer (To) running on same msgid is canceled and deleted,
Tn gets assigned on the msgid

On step 3, we are sure there's no orphan timer to fire at nothing.

Please explain your angle.

BlackOrder commented 6 months ago

please check my pull request. i suspect the same problem here. someone need to do some testing for the dlr manager. peace 👌

BlackOrder commented 6 months ago

confirmed, it's the same issue. But i can not solve it the same way. in redis we have to account for multiple payloads of Dlr level 1 and Dlr level 2. they share the same id. I don't have anymore time to give this. i hope someone can continue solving it.

jookies / jasmin

Possible Memory leak in dlr.py refer to delay logic #1142