misskey-dev / misskey

🌎 A completely free and open interplanetary microblogging platform 🚀
https://misskey-hub.net/
GNU Affero General Public License v3.0
10.1k stars 1.38k forks source link

Failing verification of LD-Signature does not remove it from inbox #14587

Open eternal-flame-AD opened 2 months ago

eternal-flame-AD commented 2 months ago

💡 Summary

"skip: LD-Signatureの検証に失敗しました" was marked as a "unrecoverableerror" but doesn't remove the message from inbox pipeline, stalling it.

🥰 Expected Behavior

If an inbox message can't be accepted, it should be removed from the pipeline immediately (or at least a retry that happens later in time) so as other messages don't get starved of workers.

🤬 Actual Behavior

I found today my instance has a >8h delay in receiving external notes. I went to see the Bull Dashboard and say 9000 pending inbox messages and growing. The CPU and memory utility was low. I investigated the running jobs and found they have been stalled for ~20 minutes. All show an error but are still in the "running state" (see reproduction section). Probably this stalled the pipeline causing the pile up.

Someone on misskey suggested it might be retrying behavior but (1) If it's a signature mismatch it should be permanent error (2) No job should take up a slot in the workers for 25 minutes without releasing it to someone else first.

Deleting the offending server alone did not resolve the issue (temporarily unblocks it but get's blocked again quickly), I need to manually purge the redis inbox of that relay for the problem to resolve.

Screenshot: https://mi.yumechi.jp/notes/9ydk0fspg51f000e (sorry I forgot to screenshot the "Active" tab, but it was the 16 jobs on the Active tab that got stalled.

📝 Steps to Reproduce

I don't know how to inject events manually into the queue so not sure how to exacly reproduce it again but I saved the event json, and the error on the Bull Dashboard. (I see an option to manually put in a JSON data+options but I don't know the underlying logic of misskey so wasn't comfortable just putting it in on my real instance, however I am will to try it out if someone on the team could give me a JSON to put in)

All activities stalled on the queue look like this (it's different events but same type and data structure, from the same relay).

{
  "jobData": {
    "activity": {
      "@context": [
        "https://www.w3.org/ns/activitystreams",
        "https://w3id.org/security/v1"
      ],
      "actor": "https://relay.infosec.exchange/actor",
      "id": "https://relay.infosec.exchange/activity/f390340e-a6d0-4b5b-83c4-130e025d770f",
      "object": "https://e-komik.org/chapter/413697",
      "to": [
        "https://relay.infosec.exchange/followers"
      ],
      "type": "Announce"
    },
    "signature": {
      "scheme": "Signature",
      "params": {
        "keyId": "https://relay.infosec.exchange/actor#main-key",
        "algorithm": "hs2019",
        "headers": [
          "(request-target)",
          "accept",
          "content-type",
          "date",
          "digest",
          "host"
        ],
        "signature": "pYJrmyxIXqEFqpJl/BZEyiNKcESqvrLNCjoHC7nWYk2cs51sC4EsdVIkBKYtJOVSwjC2F6s9YGixUiYMQs1P2NN6dBXWZzCz6EcXqjg/4wa83/a4sXi41ZTe4Zg25FtzrP4nmKcSXWnpqC0Yj9M1+Jhp066KRl4g948rN6YRG8UbCeqfyBO+3GsszlUspMy2it4MpSYUZkiYDgltNcsjtiJ6fQvcI7wYzTYPJU/YM1FRgxWJRNHrYQbUvuhZhbFXUDQgsxxA1+ntwv6umu4sDZSmSI9Kz2rq+fdsQseOEeVO7OjPF/SmFQaVBPPw13WFH3BA0G35nUCh/ZRnYQ5h0TInKG7C0bkw16V1aN16MOscz4ut3x1g9nxP5A+PerGIXjza5fRKrPdzcTOcPe6yRaR+N61OBEZLqSy77Z1ektCj4rhHY+HqJQwJWvP1vuCry0Jhukrp4pZOhJXePOBUy2rGqRkdTy7nMjh+5HVMwOywxM/lRyHrsDVG+iG/SFK7esSibZ491mH32lF1w+6CW5UCDgjsU0Qe3BXslXEkWiSJW57U+5IEvhgcwfKQnhaOY9xInFYSIAbyENnDvCNQKQM7lr4eLSXZwHTFHyLe0juoW7ClXkr4IGsEw1XY1xBUPpq6PbMklmZi/jwSjuCfhR1Hzq/IzVqY4R0e3rTx8kM="
      },
      "signingString": "(request-target): post /inbox\naccept: application/activity+json\ncontent-type: application/activity+json\ndate: Thu, 19 Sep 2024 18:12:52 GMT\ndigest: SHA-256=IWErZ8d8jaeErrZ7ICJ/l+3lfdNDR6w2Z9hbakNtSLY=\nhost: mi.yumechi.jp",
      "algorithm": "HS2019",
      "keyId": "https://relay.infosec.exchange/actor#main-key"
    }
  },
  "returnValue": null
}

Bulls dashboard shows this error:

Error: quote resolve failed
    at ApNoteService.createNote (file:///misskey/packages/backend/built/core/activitypub/models/ApNoteService.js:239:27)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async ApNoteService.resolveNote (file:///misskey/packages/backend/built/core/activitypub/models/ApNoteService.js:324:20)
    at async ApNoteService.createNote (file:///misskey/packages/backend/built/core/activitypub/models/ApNoteService.js:197:40)
    at async ApNoteService.resolveNote (file:///misskey/packages/backend/built/core/activitypub/models/ApNoteService.js:324:20)
    at async ApInboxService.announceNote (file:///misskey/packages/backend/built/core/activitypub/ApInboxService.js:282:26)
    at async ApInboxService.announce (file:///misskey/packages/backend/built/core/activitypub/ApInboxService.js:261:36)
    at async ApInboxService.performOneActivity (file:///misskey/packages/backend/built/core/activitypub/ApInboxService.js:165:20)
    at async ApInboxService.performActivity (file:///misskey/packages/backend/built/core/activitypub/ApInboxService.js:134:22)
    at async InboxProcessorService.process (file:///misskey/packages/backend/built/queue/processors/InboxProcessorService.js:187:28)
    at async Worker.processJob (/misskey/node_modules/.pnpm/bullmq@5.10.4/node_modules/bullmq/dist/cjs/classes/worker.js:445:28)
    at async Worker.retryIfFailed (/misskey/node_modules/.pnpm/bullmq@5.10.4/node_modules/bullmq/dist/cjs/classes/worker.js:634:24)

  e: {
    stack: 'UnrecoverableError: skip: LD-Signatureの検証に失敗しました\n' +
      '    at InboxProcessorService.process (file:///misskey/packages/backend/built/queue/processors/InboxProcessorService.js:1
32:27)\n' +
      '    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)\n' +
      '    at async Worker.processJob (/misskey/node_modules/.pnpm/bullmq@5.10.4/node_modules/bullmq/dist/cjs/classes/worker.js
:445:28)\n' +
      '    at async Worker.retryIfFailed (/misskey/node_modules/.pnpm/bullmq@5.10.4/node_modules/bullmq/dist/cjs/classes/worker
.js:634:24)',
    message: 'skip: LD-Signatureの検証に失敗しました',
    name: 'UnrecoverableError'
  }
}

docker and docker-compose logs for some reason stops working as soon as it's stalled ... i hope the above is enough information.

💻 Frontend Environment

Not a frontend problem. Firefox and aria both show delayed messages. 

git describe=2024.8.0-2-g882c8b93c

My instance is https://mi.yumechi.jp/

Here's note featuring some screenshots: https://mi.yumechi.jp/notes/9ydpeizaxdq8000f https://mi.yumechi.jp/notes/9ydpeizaxdq8000f

🛰 Backend Environment (for server admin)

Deployed with official docker-compose file, redis:7-alpine, postgres:15-alpine.

git describe=2024.8.0-2-g882c8b93c

OS is synology DSM, kernel 4.4, x86_64

Do you want to address this bug yourself?

I don't have time looking at the source today but I am willing to submit a PR if it is within my capabilities.

CHN-beta commented 2 months ago

Same issue.

turkeysanwich commented 2 months ago

I had the same issue, stuff from e-komik.org through a relay stalled the whole queue. blocking the relay didn't work but blocking e-komik.org seems to have cleared it up.

KisaragiEffective commented 2 months ago

permalink: https://github.com/misskey-dev/misskey/blob/882c8b93c164f2841c88b58034ed58d2553a375c/packages/backend/src/queue/processors/InboxProcessorService.ts#L142

eternal-flame-AD commented 2 months ago

Update: I read the source and could not find exactly where the issue was. However I see similar code paths to 'skip: LD-Signatureの検証に失敗しました' in today's log after I removed the offending relay and there was no stall. It seems to be something specific to the e-komik event.

Maybe add a hard timeout to the jobs, I think if one job takes more than 1 minute most instance owners would rather not receive it? ref: https://docs.bullmq.io/patterns/timeout-jobs