netty / netty-incubator-transport-io_uring

Apache License 2.0
193 stars 38 forks source link

Remove old timeout when deadline changes #232

Closed yawkat closed 8 months ago

yawkat commented 8 months ago

I had a memory issue and @franz1981 suggested #211 as the cause. This patch is my fix for that bug, though I don't believe my mem issue was ultimately caused by this.

This PR does the legwork for adding ioringOpTimeoutRemove, and implementing a test. However two things can still be improved:

normanmaurer commented 8 months ago

I had a memory issue and @franz1981 suggested #211 as the cause. This patch is my fix for that bug, though I don't believe my mem issue was ultimately caused by this.

This PR does the legwork for adding ioringOpTimeoutRemove, and implementing a test. However two things can still be improved:

  • [ ] could use IORING_TIMEOUT_UPDATE (see Not Fired timers leak into the kernel #211) to save one sqe.
  • [ ] there may be a race in IOUringEventLoop between the addTimeout and the IORING_OP_TIMEOUT handler. If the kernel fires a deadline cqe, then we send a deadline update sqe, and only then we process the first cqe, prevDeadlineNanos is NONE even though we've submitted a new deadline. I'm not sure if this can actually happen since deadline changes should only adjust the deadline downwards, not upwards? Not sure.

Thanks a lot ! I think what you did is fine... I would defer the use of IORING_TIMEOUT_UPDATE as it would also most likely bump the kernel version requirement. Maybe we can add an issue to keep track of it

normanmaurer commented 8 months ago

@yawkat so with this said I think we can put this out of draft state ?

normanmaurer commented 8 months ago

Also seems like the new added test failed :D

yawkat commented 8 months ago

i will look at the test but it might take a few days.

i think the race condition is still an issue. id like to at least think about it a bit more, consider how it can happen.

yawkat commented 8 months ago

ive introduced a "generation counter". basically, every addTimeout counts up the generation and sends it to the kernel as the extra field. When a timeout occurs, i check that the generation matches the latest timeout that was added. This ensures that a previous timeout cannot make us think that the current timeout is finished.

franz1981 commented 8 months ago

well done @yawkat ! I like the epoch/sequencing approach

normanmaurer commented 8 months ago

@yawkat seems like there is still a test failure

yawkat commented 8 months ago

thats weird. it passes locally every time, doesnt appear flakey.

normanmaurer commented 8 months ago

thats weird. it passes locally every time, doesnt appear flakey.

did you run it in a loop ?

normanmaurer commented 8 months ago

@yawkat ping

yawkat commented 8 months ago

fixed

normanmaurer commented 8 months ago

@yawkat so I think this is now ready to be merged... correct ?

yawkat commented 8 months ago

yep, only potential improvement would be to use UPDATE, but not in this pr

normanmaurer commented 8 months ago

@yawkat ok perfect... Can you just update the commit message to match our template ?

yawkat commented 8 months ago

Can you squash merge it with this message:


Remove old timeout when deadline changes (#211) Motivation:

When the event loop timer deadline changes, the old timer would not be removed. This could potentially lead to memory issues.

Modification:

Use IORING_OP_TIMEOUT_REMOVE to remove a previously set timeout before a new one is registered. Use a generation counter to prevent race conditions between timeout add/remove in sqe/cqe.

Result:

Old timers are removed. At any time, there is only at most one timer per event loop.

normanmaurer commented 8 months ago

@yawkat done... thanks a lot !

yawkat commented 8 months ago

thanks!