superseriousbusiness / gotosocial

Fast, fun, small ActivityPub server.
https://docs.gotosocial.org
GNU Affero General Public License v3.0
3.66k stars 310 forks source link

[bug] Worker deadlock when processing domain imports #2350

Open Sentynel opened 10 months ago

Sentynel commented 10 months ago

Describe the bug with a clear and concise description of what the bug is.

Importing a domain blocklist covering several hundred domains causes my GtS server to stop processing any tasks on the worker pool. Processing of the domain blocklist ceases part way through, and subsequent incoming statuses etc are not processed.

This is because the way we handle domain blocks uses the worker pool queue APIs in an unsafe way and can deadlock if the queue fills up. A stack trace of one of the workers while stuck looks like this:

goroutine 19 [select]:
runtime.gopark(0xc000ae73e0?, 0x3?, 0x50?, 0xe7?, 0xc000ae73ca?)
    /usr/local/go/src/runtime/proc.go:398 +0xce fp=0xc000ae7270 sp=0xc000ae7250 pc=0x43c2ce
runtime.selectgo(0xc000ae73e0, 0xc000ae73c4, 0x38?, 0x1, 0xc001acc800?, 0x1)
    /usr/local/go/src/runtime/select.go:327 +0x725 fp=0xc000ae7390 sp=0xc000ae7270 pc=0x44bd65
codeberg.org/gruf/go-runners.(*WorkerPool).MustEnqueueCtx(0xc000517390, {0x2b968d8?, 0xc00074a0c0?}, 0xc002226b80)
    /drone/src/vendor/codeberg.org/gruf/go-runners/pool.go:169 +0xce fp=0xc000ae7420 sp=0xc000ae7390 pc=0xa98b6e
github.com/superseriousbusiness/gotosocial/internal/processing/workers.(*Processor).EnqueueClientAPI(0xc0005161f0, {0x2b968d8?, 0xc00074a0c0}, {0xc0019df000, 0x29, 0x40})
    /drone/src/internal/processing/workers/fromclientapi.go:51 +0xd4 fp=0xc000ae7458 sp=0xc000ae7420 pc=0x200c6f4
github.com/superseriousbusiness/gotosocial/internal/processing/workers.(*Processor).EnqueueClientAPI-fm({0x2b968d8?, 0xc00074a0c0?}, {0xc0019df000?, 0x3f7a420?, 0x2461700?})
    <autogenerated>:1 +0x45 fp=0xc000ae7498 sp=0xc000ae7458 pc=0x2140ec5
github.com/superseriousbusiness/gotosocial/internal/processing/account.(*Processor).deleteAccountStatuses(0xc0008fc600, {0x2b968d8, 0xc00074a0c0}, 0xc001865180)
    /drone/src/internal/processing/account/delete.go:382 +0x1a7 fp=0xc000ae7710 sp=0xc000ae7498 pc=0x1fd48c7
github.com/superseriousbusiness/gotosocial/internal/processing/account.(*Processor).Delete(0xc0008fc600, {0x2b968d8, 0xc00074a0c0}, 0xc001865180, {0xc00130edc0, 0x1a})
    /drone/src/internal/processing/account/delete.go:57 +0x2dc fp=0xc000ae7818 sp=0xc000ae7710 pc=0x1fd2f9c
github.com/superseriousbusiness/gotosocial/internal/processing/workers.(*clientAPI).DeleteAccount(0xc000711740, {0x2b968d8, 0xc00074a0c0}, {{0x2608680, 0x6}, {0x260898c, 0x6}, {0x2485860, 0xc00146ebe0}, 0xc001865180, ...})
    /drone/src/internal/processing/workers/fromclientapi.go:544 +0x105 fp=0xc000ae7888 sp=0xc000ae7818 pc=0x200f8a5
github.com/superseriousbusiness/gotosocial/internal/processing/workers.(*Processor).ProcessFromClientAPI(0xc0005161f0, {0x2b968d8, 0xc00074a0c0}, {{0x2608680, 0x6}, {0x260898c, 0x6}, {0x2485860, 0xc00146ebe0}, 0xc001865180, ...})
    /drone/src/internal/processing/workers/fromclientapi.go:179 +0xaf3 fp=0xc000ae79f0 sp=0xc000ae7888 pc=0x200d4d3
github.com/superseriousbusiness/gotosocial/internal/processing/workers.(*Processor).ProcessFromClientAPI-fm({0x2b968d8?, 0xc00074a0c0?}, {{0x2608680, 0x6}, {0x260898c, 0x6}, {0x2485860, 0xc00146ebe0}, 0xc001865180, 0xc001865180})
    <autogenerated>:1 +0x66 fp=0xc000ae7a58 sp=0xc000ae79f0 pc=0x2141026
github.com/superseriousbusiness/gotosocial/internal/processing/admin.(*Processor).domainBlockSideEffects.func1(0x0?)
    /drone/src/internal/processing/admin/domainblock.go:154 +0xcf fp=0xc000ae7b10 sp=0xc000ae7a58 pc=0x1fe200f
github.com/superseriousbusiness/gotosocial/internal/processing/admin.(*Processor).rangeDomainAccounts(0xc000516078, {0x2b968d8, 0xc00074a0c0}, {0xc00111f8f0, 0xd}, 0xc000ae7c78)
    /drone/src/internal/processing/admin/util.go:96 +0x194 fp=0xc000ae7ba8 sp=0xc000ae7b10 pc=0x1feaf14
github.com/superseriousbusiness/gotosocial/internal/processing/admin.(*Processor).domainBlockSideEffects(0xc000516078, {0x2b968d8, 0xc00074a0c0}, 0xc00146ebe0)
    /drone/src/internal/processing/admin/domainblock.go:145 +0x2c8 fp=0xc000ae7cb8 sp=0xc000ae7ba8 pc=0x1fe1cc8
github.com/superseriousbusiness/gotosocial/internal/processing/admin.(*Processor).createDomainBlock.func1({0x2b968d8, 0xc00074a0c0})
    /drone/src/internal/processing/admin/domainblock.go:98 +0x31b fp=0xc000ae7e58 sp=0xc000ae7cb8 pc=0x1fe18db
github.com/superseriousbusiness/gotosocial/internal/processing/admin.(*Actions).Run.func1({0x2b968d8, 0xc00074a0c0})
    /drone/src/internal/processing/admin/actions.go:103 +0x63 fp=0xc000ae7f48 sp=0xc000ae7e58 pc=0x1fdef03
codeberg.org/gruf/go-runners.worker_run({0x2b968d8, 0xc00074a0c0}, 0x0?)
    /drone/src/vendor/codeberg.org/gruf/go-runners/pool.go:237 +0x59 fp=0xc000ae7f88 sp=0xc000ae7f48 pc=0xa98e99
codeberg.org/gruf/go-runners.(*WorkerPool).Start.func1.2()
    /drone/src/vendor/codeberg.org/gruf/go-runners/pool.go:71 +0x72 fp=0xc000ae7fe0 sp=0xc000ae7f88 pc=0xa986b2
runtime.goexit()
    /usr/local/go/src/runtime/asm_amd64.s:1650 +0x1 fp=0xc000ae7fe8 sp=0xc000ae7fe0 pc=0x46c3e1
created by codeberg.org/gruf/go-runners.(*WorkerPool).Start.func1 in goroutine 13
    /drone/src/vendor/codeberg.org/gruf/go-runners/pool.go:67 +0x89

Note that deleteAccountStatuses(), which loops over all statuses associated with the account we're currently deleting and queues new tasks for removing them all, is calling MustEnqueueCtx() (via EnqueueClientAPI()). This is unsafe because we're in a worker context at the point we call this, and this in MustEnqueueCtx() we attempt to push a new function into a buffered channel of finite size with our own function in it. Do this to all of the workers simultaneously, and we deadlock with all workers waiting for somebody to pop a function off the worker pool to free space.

We should ensure that we don't call MustEnqueueCtx() from a worker context. In this particular case we can solve the issue by processing status deletions from our own task rather than enqueuing new ones to avoid flooding the task queue. This also happens for follows processing in the same file.

What's your GoToSocial Version?

v0.12.0

GoToSocial Arch

No response

What happened?

No response

What you expected to happen?

No response

How to reproduce it?

No response

Anything else we need to know?

No response

tsmethurst commented 10 months ago

Thank you for the excellent investigation and writeup :)

NyaaaWhatsUpDoc commented 9 months ago

This is partially fixed in tobi's above PR ^

Though as has been determined, there's still a possible issue that enqueuing async work from within a worker function can cause a deadlock if the worker queues are currently full. This is a fix I'm working on for the go-runners package.

tsmethurst commented 9 months ago

Removing this from the current milestone since we fixed the most egregious example for 0.13.0; we can work on a more thorough fix for 0.14.0.