fix multi goroutine unlock after add running cause cap > p.running no…

panjf2000 / ants

🐜🐜🐜 ants is the most powerful and reliable pooling solution for Go.

https://ants.andypan.me/

MIT License

12.69k stars 1.36k forks source link

fix multi goroutine unlock after add running cause cap > p.running no… #336

Closed Cyan1614 closed 1 month ago

Cyan1614 commented 1 month ago

1. Are you opening this pull request for bug-fixs, optimizations or new feature?

bug-fixs

2. Please describe how these code changes achieve your intention.

if multi goroutine submit a task, it will unlock early than add running num, it will cause run out of worker.

3. Please link to the relevant issues (if any).

4. Which documentation changes (if any) need to be made/updated because of this PR?

4. Checklist

[ ] I have squashed all insignificant commits.
[ ] I have commented my code for explaining package types, values, functions, and non-obvious lines.
[ ] I have written unit tests and verified that all tests passes (if needed).
[ ] I have documented feature info on the README (only when this PR is adding a new feature).
[ ] (optional) I am willing to help maintain this change if there are issues with it later.

panjf2000 commented 1 month ago

I don't quite get your point here, please provide an example illustrating this issue.

Cyan1614 commented 1 month ago

sry for reply late and thank you for your reply :), here it's my demo program. my version is go 1.22

func main() {
    fmt.Println(runtime.Version())
    var (
        pool, _ = ants.NewPool(1)
    )
    for i := 0; i < 10; i++ {
        go func() {
            _ = pool.Submit(func() {
                fmt.Println(time.Now().Unix())
                time.Sleep(1 * time.Second)
            })
        }()
    }
    // Waiting all goroutine done
    time.Sleep(10 * time.Second)
}

the result sometimes will be same timestamp, sometimes timestamp will one by one.

if want to regression this case, it need run serval times.

because the submit.retrieveWorker function is do capacity > p.Running() first and then unlock and addrunning num, it will have concurrency problem cause addrunning num not correct. it need addrunning num and then unlock make sure concurrency safe.

panjf2000 commented 1 month ago

中国人？我还是没理解你说的问题，你描述的这个问题具体会造成什么后果？是会导致程序出现什么问题？能具体说一下吗？

Cyan1614 commented 1 month ago

oh 可以说中文太好了。我重新用中文复述一遍。

前提：pool 中没有 running worker

问题：当 pool 的 cap 为 1 时，多个协程并发调用 submit 运行自定义 func 不能保证有且仅有一个 worker 运行。

可能造成的后果：cap 为 1，却有多个 worker 运行自定义 func。

原因：初次 retrieve worker 时，是判断 cap < running num 后是先 unlock 再 add running num。会导致在并发场景下还没来得及 add running num 完成，别的协程判断 cap < running num 成功再次创建多一个 worker。

panjf2000 commented 1 month ago

可是你这个 PR 也无法修复这个问题，就算先启动 worker 再解锁，runtime 也不一定会立刻调度运行 worker goroutine，这时候还是有可能发生 worker 数量溢出的问题。把 pool.addRunning() 从 worker 代码里抽出来放到 retrieveWorker() 应该可以真正解决这个问题，但是我在想有没有必要，这种情况发生的概率很小，而且就算在高并发的场景中有稍微的 worker 数量溢出，算是一种 bursts，这在高并发场景中也算是一种比较常见的策略，所以我感觉没什么问题。

Cyan1614 commented 1 month ago

先运行 worker 再解锁能够保证 running num 是并发安全的，保证并发情况下只有一个初始化的 worker ，和 runtime 调度无关。能够解决 worker 溢出问题的。

我认同稍微的溢出是高并发 bursts 策略，但如果使用者是想严格控制协程运行数量的话现在的方案并没有解决的问题。anyway 我们可以保留这个问题看看其他用户有无更多严格控制协程运行的诉求。

感谢回复。