matrix-org / dendrite

Dendrite is a second-generation Matrix homeserver written in Go!
https://matrix-org.github.io/dendrite/
Apache License 2.0
5.67k stars 664 forks source link

State migration fails while rewriting snapshots after upgrading to v0.4.0 #1924

Closed jellykells closed 2 years ago

jellykells commented 3 years ago

Background information

Description

I'm not sure how to interpret this error. Clearly the state storage upgrade did not complete, despite the log saying it did.

Jul 15 22:54:01 dendrite-host dendrite-monolith-server[169231]: time="2021-07-16T03:54:01.897195913Z" level=warning msg="Rewriting snapshots 3000-3100 of 21375..." func="UpStateBlocksRefactor\n\t" file=" [2021041615092700_state_blocks_refactor.go:140]"
Jul 15 22:54:02 dendrite-host dendrite-monolith-server[169231]: time="2021-07-16T03:54:02.667781275Z" level=warning msg="State storage upgrade complete" func="UpStateBlocksRefactor\n\t" file=" [2021041615092700_state_blocks_refactor.go:178]"
Jul 15 22:54:02 dendrite-host dendrite-monolith-server[169231]: time="2021-07-16T03:54:02.679749124Z" level=panic msg="failed to connect to room server db" func="NewInternalAPI\n\t" file=" [roomserver.go:54]" error="RunDeltas: Failed run migration: ERROR 2021041615092700_state_blocks_refactor.go: failed to run Go migration function func(*sql.Tx) error: cannot xref null state block with snapshot 13936: sql: no rows in result set"
Jul 15 22:54:02 dendrite-host dendrite-monolith-server[169231]: panic: (*logrus.Entry) (0x1261380,0xc00011e230)
Jul 15 22:54:02 dendrite-host dendrite-monolith-server[169231]: goroutine 1 [running]:
Jul 15 22:54:02 dendrite-host dendrite-monolith-server[169231]: github.com/sirupsen/logrus.(*Entry).log(0xc00011e1c0, 0x0, 0xc000198240, 0x23)
Jul 15 22:54:02 dendrite-host dendrite-monolith-server[169231]:         github.com/sirupsen/logrus@v1.8.0/entry.go:259 +0x2e2
Jul 15 22:54:02 dendrite-host dendrite-monolith-server[169231]: github.com/sirupsen/logrus.(*Entry).Log(0xc00011e1c0, 0x0, 0xc000271b88, 0x1, 0x1)
Jul 15 22:54:02 dendrite-host dendrite-monolith-server[169231]:         github.com/sirupsen/logrus@v1.8.0/entry.go:285 +0x86
Jul 15 22:54:02 dendrite-host dendrite-monolith-server[169231]: github.com/sirupsen/logrus.(*Entry).Logf(0xc00011e1c0, 0x0, 0x1297d7d, 0x23, 0x0, 0x0, 0x0)
Jul 15 22:54:02 dendrite-host dendrite-monolith-server[169231]:         github.com/sirupsen/logrus@v1.8.0/entry.go:330 +0xe2
Jul 15 22:54:02 dendrite-host dendrite-monolith-server[169231]: github.com/sirupsen/logrus.(*Entry).Panicf(...)
Jul 15 22:54:02 dendrite-host dendrite-monolith-server[169231]:         github.com/sirupsen/logrus@v1.8.0/entry.go:368
Jul 15 22:54:02 dendrite-host dendrite-monolith-server[169231]: github.com/matrix-org/dendrite/roomserver.NewInternalAPI(0xc00011c630, 0x146f860, 0xc0002a1cb0, 0xc0000c7580, 0x149ad00)
Jul 15 22:54:02 dendrite-host dendrite-monolith-server[169231]:         github.com/matrix-org/dendrite@/roomserver/roomserver.go:54 +0x237
Jul 15 22:54:02 dendrite-host dendrite-monolith-server[169231]: main.main()
Jul 15 22:54:02 dendrite-host dendrite-monolith-server[169231]:         github.com/matrix-org/dendrite@/cmd/dendrite-monolith-server/main.go:86 +0x27c
Jul 15 22:54:02 dendrite-host systemd[1]: dendrite.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Jul 15 22:54:02 dendrite-host systemd[1]: dendrite.service: Failed with result 'exit-code'.

Steps to reproduce

SkyFox42 commented 3 years ago

I am seeing the same error. Also running in monolith mode with Postgres and no Docker, but Go version is 1.16.4

Jul 16 22:11:01 host dendrite-monolith-server[26160]: time="2021-07-16T20:11:01.618840142Z" level=warning msg="Rewriting snapshots 7500-7600 of 31277..." func="UpStateBlocksRefactor\n\t" file=" [2021041615092700_state_blocks_refactor.go:140]"
Jul 16 22:11:01 host dendrite-monolith-server[26160]: time="2021-07-16T20:11:01.674271029Z" level=warning msg="State storage upgrade complete" func="UpStateBlocksRefactor\n\t" file=" [2021041615092700_state_blocks_refactor.go:178]"
Jul 16 22:11:01 host dendrite-monolith-server[26160]: time="2021-07-16T20:11:01.681480025Z" level=panic msg="failed to connect to room server db" func="NewInternalAPI\n\t" file=" [roomserver.go:54]" error="RunDeltas: Failed run migration: ERROR 2021041615092700_state_blocks_refactor.go: failed to run Go migration function func(*sql.Tx) error: cannot xref null state block with snapshot 7503: sql: no rows in result set"
Jul 16 22:11:01 host dendrite-monolith-server[26160]: panic: (*logrus.Entry) 0xc00031c230
Jul 16 22:11:01 host dendrite-monolith-server[26160]: goroutine 1 [running]:
Jul 16 22:11:01 host dendrite-monolith-server[26160]: github.com/sirupsen/logrus.(*Entry).log(0xc00031c1c0, 0x0, 0xc00026c270, 0x23)
Jul 16 22:11:01 host dendrite-monolith-server[26160]:         github.com/sirupsen/logrus@v1.8.0/entry.go:259 +0x2e5
Jul 16 22:11:01 host dendrite-monolith-server[26160]: github.com/sirupsen/logrus.(*Entry).Log(0xc00031c1c0, 0x0, 0xc0000f7c08, 0x1, 0x1)
Jul 16 22:11:01 host dendrite-monolith-server[26160]:         github.com/sirupsen/logrus@v1.8.0/entry.go:285 +0x86
Jul 16 22:11:01 host dendrite-monolith-server[26160]: github.com/sirupsen/logrus.(*Entry).Logf(0xc00031c1c0, 0x0, 0x11d5886, 0x23, 0x0, 0x0, 0x0)
Jul 16 22:11:01 host dendrite-monolith-server[26160]:         github.com/sirupsen/logrus@v1.8.0/entry.go:330 +0xe5
Jul 16 22:11:01 host dendrite-monolith-server[26160]: github.com/sirupsen/logrus.(*Entry).Panicf(...)
Jul 16 22:11:01 host dendrite-monolith-server[26160]:         github.com/sirupsen/logrus@v1.8.0/entry.go:368
Jul 16 22:11:01 host dendrite-monolith-server[26160]: github.com/matrix-org/dendrite/roomserver.NewInternalAPI(0xc000258ea0, 0x1332060, 0xc000126420, 0xc000204800, 0x1353078)
Jul 16 22:11:01 host dendrite-monolith-server[26160]:         github.com/matrix-org/dendrite/roomserver/roomserver.go:54 +0x20c
Jul 16 22:11:01 host dendrite-monolith-server[26160]: main.main()
Jul 16 22:11:01 host dendrite-monolith-server[26160]:         github.com/matrix-org/dendrite/cmd/dendrite-monolith-server/main.go:86 +0x271
Jul 16 22:11:01 host systemd[1]: dendrite.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Jul 16 22:11:01 host systemd[1]: dendrite.service: Failed with result 'exit-code'.
kegsay commented 3 years ago

So this error can happen when converting rooms based on Dendrite v0.1.0 which didn't correctly set the state block for m.room.create events. If you're getting this error that implies that there is a bad state snapshot for an event that isn't the m.room.create event, which is unusual and unexpected.

To fix this, you can delete the offending snapshot entirely: though this may result in room state errors later down the line for this specific room. I would be interested in the output to the following command:

SELECT * FROM roomserver_events WHERE state_snapshot_nid = 1234;

where 1234 is the offending snapshot NID (7503 or 13936 in this case). This information shouldn't be sensitive so can be inserted into this issue.

SkyFox42 commented 3 years ago

I ran the command and the output is empty:

dendrite_room_server=# SELECT * FROM roomserver_events WHERE state_snapshot_nid = 7503;
 event_nid | room_nid | event_type_nid | event_state_key_nid | sent_to_output | state_snapshot_nid | depth | event_id | reference_sha256 | auth_event_nids | is_rejected 
-----------+----------+----------------+---------------------+----------------+--------------------+-------+----------+------------------+-----------------+-------------
(0 rows)
kegsay commented 3 years ago

Okay, so I can modify this check to not error out if there are no events. I don't know how the database got into this state (as it implies you have snapshots which aren't referenced by any events), but that shouldn't be fatal. Expect a patch on Monday on master if you don't mind experimenting with it (back up your database first!).

jellykells commented 3 years ago

@kegsay I upgraded to v0.4.1 and while the state migration got further this time, it failed with a new error while rewriting the last batch. Is there a way to determine which event failed to be converted?

Jul 26 22:18:42 dendrite-host dendrite-monolith-server[7399]: time="2021-07-27T03:18:42.875641660Z" level=warning msg="Rewriting snapshots 21300-21400 of 21375..." func="UpStateBlocksRefactor\n\t" file=" [2021041615092700_state_blocks_refactor.go:140]"
Jul 26 22:18:47 dendrite-host dendrite-monolith-server[7399]: time="2021-07-27T03:18:47.771193581Z" level=warning msg="State storage upgrade complete" func="UpStateBlocksRefactor\n\t" file=" [2021041615092700_state_blocks_refactor.go:247]"
Jul 26 22:18:47 dendrite-host dendrite-monolith-server[7399]: time="2021-07-27T03:18:47.890564448Z" level=panic msg="failed to connect to room server db" func="NewInternalAPI\n\t" file=" [roomserver.go:54]" error="RunDeltas: Failed run migration: ERROR 2021041615092700_state_blocks_refactor.go: failed to run Go migration function func(*sql.Tx) error: 1 events exist in roomserver_events which have not been converted to a new state_snapshot_nid; this is a bug, please report"
Jul 26 22:18:47 dendrite-host dendrite-monolith-server[7399]: panic: (*logrus.Entry) (0x126ee00,0xc0001701c0)
Jul 26 22:18:47 dendrite-host dendrite-monolith-server[7399]: goroutine 1 [running]:
Jul 26 22:18:47 dendrite-host dendrite-monolith-server[7399]: github.com/sirupsen/logrus.(*Entry).log(0xc000170150, 0x0, 0xc00003a1e0, 0x23)
Jul 26 22:18:47 dendrite-host dendrite-monolith-server[7399]:         github.com/sirupsen/logrus@v1.8.1/entry.go:259 +0x2e2
Jul 26 22:18:47 dendrite-host dendrite-monolith-server[7399]: github.com/sirupsen/logrus.(*Entry).Log(0xc000170150, 0x0, 0xc000283b88, 0x1, 0x1)
Jul 26 22:18:47 dendrite-host dendrite-monolith-server[7399]:         github.com/sirupsen/logrus@v1.8.1/entry.go:293 +0x86
Jul 26 22:18:47 dendrite-host dendrite-monolith-server[7399]: github.com/sirupsen/logrus.(*Entry).Logf(0xc000170150, 0x0, 0x12a5a6c, 0x23, 0x0, 0x0, 0x0)
Jul 26 22:18:47 dendrite-host dendrite-monolith-server[7399]:         github.com/sirupsen/logrus@v1.8.1/entry.go:338 +0xe2
Jul 26 22:18:47 dendrite-host dendrite-monolith-server[7399]: github.com/sirupsen/logrus.(*Entry).Panicf(...)
Jul 26 22:18:47 dendrite-host dendrite-monolith-server[7399]:         github.com/sirupsen/logrus@v1.8.1/entry.go:376
Jul 26 22:18:47 dendrite-host dendrite-monolith-server[7399]: github.com/matrix-org/dendrite/roomserver.NewInternalAPI(0xc00022e240, 0x1483fc0, 0xc0002f8840, 0xc000235400, 0x14af7c0)
Jul 26 22:18:47 dendrite-host dendrite-monolith-server[7399]:         github.com/matrix-org/dendrite@/roomserver/roomserver.go:54 +0x237
Jul 26 22:18:47 dendrite-host dendrite-monolith-server[7399]: main.main()
Jul 26 22:18:47 dendrite-host dendrite-monolith-server[7399]:         github.com/matrix-org/dendrite@/cmd/dendrite-monolith-server/main.go:88 +0x27c
Jul 26 22:18:47 dendrite-host systemd[1]: dendrite.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Jul 26 22:18:47 dendrite-host systemd[1]: dendrite.service: Failed with result 'exit-code'.
kegsay commented 3 years ago

Yes, I can add additional logging here to help debug this issue.

anton-molyboha commented 3 years ago

I have the same issue.

(And when I was trying to upgrade to v0.4.0, I had the same issue as in the original post)

kegsay commented 3 years ago

1960 should fix the issue for SQLite users. We're investigating fixes for Postgres users.

anton-molyboha commented 3 years ago

I upgraded to 4cc8b28b7f341c21a0bf7a261d6b49e8276e1e99 and migration is failing with a new error this time.

WARN[2021-08-04T20:59:44.817088875Z] [2021041615092700_state_blocks_refactor.go:152] UpStateBlocksRefactor
         Rewriting snapshots 20600-20700 of 20675...
WARN[2021-08-04T21:00:06.717207316Z] [2021041615092700_state_blocks_refactor.go:265] UpStateBlocksRefactor
         State storage upgrade complete
PANI[2021-08-04T21:00:06.738043629Z] [roomserver.go:54] NewInternalAPI
         failed to connect to room server db           error="RunDeltas: Failed run migration: ERROR 2021041615092700_state_blocks_refactor.go: failed to run Go migration function func(*sql.Tx) error: 1 rooms exist in roomserver_rooms which have not been converted to a new state_snapshot_nid; this is a bug, please report"
panic: (*logrus.Entry) 0xc0002481c0

goroutine 1 [running]:
github.com/sirupsen/logrus.(*Entry).log(0xc000248150, 0x0, 0xc001574660, 0x23)
        github.com/sirupsen/logrus@v1.8.1/entry.go:259 +0x2e5
github.com/sirupsen/logrus.(*Entry).Log(0xc000248150, 0x0, 0xc0000f5bf0, 0x1, 0x1)
        github.com/sirupsen/logrus@v1.8.1/entry.go:293 +0x86
github.com/sirupsen/logrus.(*Entry).Logf(0xc000248150, 0x0, 0x1245bfe, 0x23, 0x0, 0x0, 0x0)
        github.com/sirupsen/logrus@v1.8.1/entry.go:338 +0xe5
github.com/sirupsen/logrus.(*Entry).Panicf(...)
        github.com/sirupsen/logrus@v1.8.1/entry.go:376
github.com/matrix-org/dendrite/roomserver.NewInternalAPI(0xc0000eb320, 0x13ac080, 0xc000276e10, 0xc000115f00, 0x13ce7c8)
        github.com/matrix-org/dendrite/roomserver/roomserver.go:54 +0x20c
main.main()
        github.com/matrix-org/dendrite/cmd/dendrite-monolith-server/main.go:89 +0x271
kegsay commented 3 years ago

@anton-molyboha on postgres or sqlite?

anton-molyboha commented 3 years ago

@anton-molyboha on postgres or sqlite?

Postgres.

jellykells commented 3 years ago

I'm also still seeing the same error using 4cc8b28. Running with Go 1.16.5 instead of 1.13.8 now

Edit: The same error that I last posted, not the error anton posted.

kegsay commented 3 years ago

@jellykells Are you positive? There are a few similar looking messages (the table names vary) so can you double check it?

I might need some more input on your precise DB to help debug this further.

jellykells commented 3 years ago

@kegsay Yes I just double checked. I am running commit 4cc8b28b:

Aug 09 07:22:52 dendrite-host systemd[1]: Started Dendrite Matrix server.
Aug 09 07:22:52 dendrite-host dendrite-monolith-server[160547]: time="2021-08-09T12:22:52.304521683Z" level=info msg="Dendrite version 0.4.1+4cc8b28b" func="NewBaseDendrite\n\t" file=" [base.go:113]"

The same error occurs:

Aug 09 07:52:05 dendrite-host dendrite-monolith-server[160547]: time="2021-08-09T12:52:05.624190350Z" level=warning msg="Rewriting snapshots 21300-21400 of 21375..." func="UpStateBlocksRefactor\n\t" file=" [2021041615092700_state_blocks_refactor.go:152]"
Aug 09 07:52:11 dendrite-host dendrite-monolith-server[160547]: time="2021-08-09T12:52:11.596932232Z" level=warning msg="State storage upgrade complete" func="UpStateBlocksRefactor\n\t" file=" [2021041615092700_state_blocks_refactor.go:259]"
Aug 09 07:52:11 dendrite-host dendrite-monolith-server[160547]: time="2021-08-09T12:52:11.651450905Z" level=panic msg="failed to connect to room server db" func="NewInternalAPI\n\t" file=" [roomserver.go:54]" error="RunDeltas: Failed run migration: ERROR 2021041615092700_state_blocks_refactor.go: failed to run Go migration function func(*sql.Tx) error: 1 events exist in roomserver_events which have not been converted to a new state_snapshot_nid; this is a bug, please report"
Aug 09 07:52:11 dendrite-host dendrite-monolith-server[160547]: panic: (*logrus.Entry) 0xc00018a150
Aug 09 07:52:11 dendrite-host dendrite-monolith-server[160547]: goroutine 1 [running]:
Aug 09 07:52:11 dendrite-host dendrite-monolith-server[160547]: github.com/sirupsen/logrus.(*Entry).log(0xc00018a0e0, 0x0, 0xc0003242a0, 0x23)
Aug 09 07:52:11 dendrite-host dendrite-monolith-server[160547]:         github.com/sirupsen/logrus@v1.8.1/entry.go:259 +0x2e5
Aug 09 07:52:11 dendrite-host dendrite-monolith-server[160547]: github.com/sirupsen/logrus.(*Entry).Log(0xc00018a0e0, 0x0, 0xc000193bf0, 0x1, 0x1)
Aug 09 07:52:11 dendrite-host dendrite-monolith-server[160547]:         github.com/sirupsen/logrus@v1.8.1/entry.go:293 +0x86
Aug 09 07:52:11 dendrite-host dendrite-monolith-server[160547]: github.com/sirupsen/logrus.(*Entry).Logf(0xc00018a0e0, 0x0, 0x1234c67, 0x23, 0x0, 0x0, 0x0)
Aug 09 07:52:11 dendrite-host dendrite-monolith-server[160547]:         github.com/sirupsen/logrus@v1.8.1/entry.go:338 +0xe5
Aug 09 07:52:11 dendrite-host dendrite-monolith-server[160547]: github.com/sirupsen/logrus.(*Entry).Panicf(...)
Aug 09 07:52:11 dendrite-host dendrite-monolith-server[160547]:         github.com/sirupsen/logrus@v1.8.1/entry.go:376
Aug 09 07:52:11 dendrite-host dendrite-monolith-server[160547]: github.com/matrix-org/dendrite/roomserver.NewInternalAPI(0xc000189170, 0x139b040, 0xc0004008a0, 0xc00012ef80, 0x13bd788)
Aug 09 07:52:11 dendrite-host dendrite-monolith-server[160547]:         github.com/matrix-org/dendrite/roomserver/roomserver.go:54 +0x20c
Aug 09 07:52:11 dendrite-host dendrite-monolith-server[160547]: main.main()
Aug 09 07:52:11 dendrite-host dendrite-monolith-server[160547]:         github.com/matrix-org/dendrite/cmd/dendrite-monolith-server/main.go:89 +0x271
Aug 09 07:52:11 dendrite-host systemd[1]: dendrite.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Aug 09 07:52:11 dendrite-host systemd[1]: dendrite.service: Failed with result 'exit-code'.

Happy to provide additional info if necessary.

kegsay commented 3 years ago

I'll add some additional logging here and let you know when you can try again.

jellykells commented 3 years ago

Any updates on this?

kegsay commented 3 years ago

I'll look into this this week and push some logs to master. I've been on holiday for a bit.

kegsay commented 3 years ago

@jellykells and @anton-molyboha please try the latest master and give me the log output when it fails (in this issue should be fine). There will be queries you can run to manually fix this, but it will involve data loss for the affected room.

anton-molyboha commented 3 years ago

@kegsay The last messages in the output:

time="2021-09-07T16:38:28.883294305Z" level=warning msg="Rewriting snapshots 21200-21300 of 21244..." func="UpStateBlocksRefactor\n\t" file=" [2021041615092700_state_blocks_refactor.go:152]"
time="2021-09-07T16:38:40.035361023Z" level=error msg="Affected row: room_id=!xxyFVdFcocTnFefXVY:matrix.org snapshot=3967 last_sent=794889" func="UpStateBlocksRefactor\n\t" file=" [2021041615092700_state_blocks_refactor.go:289]"
time="2021-09-07T16:38:40.035934574Z" level=error msg="To fix this manually, run this query first then retry the migration: UPDATE roomserver_rooms SET state_snapshot_nid=0 WHERE room_id='!xxyFVdFcocTnFefXVY:matrix.org'" func="UpStateBlocksRefactor\n\t" file=" [2021041615092700_state_blocks_refactor.go:293]"
time="2021-09-07T16:38:40.035974398Z" level=error msg="Running this UPDATE will cause the room in question to become unavailable on this server. Leave and re-join the room afterwards." func="UpStateBlocksRefactor\n\t" file=" [2021041615092700_state_blocks_refactor.go:295]"
time="2021-09-07T16:38:40.037369382Z" level=warning msg="State storage upgrade complete" func="UpStateBlocksRefactor\n\t" file=" [2021041615092700_state_blocks_refactor.go:297]"
time="2021-09-07T16:38:40.071962625Z" level=panic msg="failed to connect to room server db" func="NewInternalAPI\n\t" file=" [roomserver.go:54]" error="RunDeltas: Failed run migration: ERROR 2021041615092700_state_blocks_refactor.go: failed to run Go migration function func(*sql.Tx) error: 1 rooms exist in roomserver_rooms which have not been converted to a new state_snapshot_nid; this is a bug, please report"
panic: (*logrus.Entry) 0xc000240460

I will try following the instructions and report here how it goes.

anton-molyboha commented 3 years ago

Running that "UPDATE" SQL statement solved the issue! I am now running commit 7dc8fb1fe780888d5758bf665949e9e09d21e56a. Thank you!

jellykells commented 2 years ago

I ended up resetting my Dendrite server due to other issues, but I believe this specific issue was also fixed for me by running the suggested SQL statement added in #2005 , so I'll go ahead and close this.