matrix-org / synapse

Synapse: Matrix homeserver written in Python/Twisted.
https://matrix-org.github.io/synapse
Apache License 2.0
11.82k stars 2.13k forks source link

"Error binding parameter" with sqlite on synapse 1.5 #6360

Closed msirringhaus closed 4 years ago

msirringhaus commented 4 years ago

Description

After updating from 1.4 to 1.5.1 my homeserver is acting very weird. Various messages are not delivered for hours, then all of a sudden get delivered. I did see these in the logs for various users:

 POST-16984 - [TXN OPERROR] {store_device-fc23} SQL logic error or missing database 0/5
 POST-16984 - store_device with device_id=str('IXLAESIPAU') user_id=str('@[...snip...]') display_name=NoneType(None) failed: Error binding parameter 0 - probably unsupported type.

Its a small server, so I was running on sqlite for several years. Releasenotes of 1.5 stated that this should not be used, I figured my problems might stem from that.

So I migrated to postgres according to this

Now I see on startup:

Nov 13 09:00:16 server python[18567]: 2019-11-13 09:00:16,123 - synapse.storage.data_stores.main.pusher - 51 - WARNING - start_pushers-0 - Invalid JSON in data for pusher 13
871: \x7b22666f726d[...snip...]79227d, Expecting value: line 1 column 1 (char 0)                                                                                                                                                
Nov 13 09:00:16 server python[18567]: 2019-11-13 09:00:16,124 - synapse.storage.data_stores.main.pusher - 51 - WARNING - start_pushers-0 - Invalid JSON in data for pusher 18
645: \x7b22666f726d6[...snip...]79227d, Expecting value: line 1 column 1 (char 0)                                                                                                                                                
Nov 13 09:00:16 server python[18567]: 2019-11-13 09:00:16,125 - synapse.storage.data_stores.main.pusher - 51 - WARNING - start_pushers-0 - Invalid JSON in data for pusher 13
750: \x7b22666f726d6[....snip...]79227d, Expecting value: line 1 column 1 (char 0)   
...
lots of those. Then for each user:
...
Nov 13 09:00:16 server python[18567]: 2019-11-13 09:00:16,142 - synapse.push.pusherpool - 263 - WARNING - start_pushers-0 - Pusher incorrectly configured user=@[...snip...], appid=im.vector.app.android, pushkey=eNsixsGcj[...snip...]vyb36t4_ftxG: data can not be null for HTTP pusher
Nov 13 09:00:16 server python[18567]: 2019-11-13 09:00:16,148 - synapse.push.pusherpool - 263 - WARNING - start_pushers-0 - Pusher incorrectly configured user=@[...snip...], appid=im.vector.app.android, pushkey=dzY9eZ73c[...snip...]ereDezVYk: data can not be null for HTTP pusher
...

Some contact seem to work fine (federated and local), others not.

After migrating to postgres, I have the (additional?) problem of not being able to open some encrypted messages (even my own message): * Unable to decrypt: The sender's device has not sent us the keys for this message. ** Hitting "request again" does nothing. I tested android and web, cleared the cache and reloaded. Still not able to get the keys from other devices and decrypt my own messages.

One additional thing: CPU usage of synapse rose from almost 0% to constant 6% after the migration

Version information

If not matrix.org:

msirringhaus commented 4 years ago

Follow-up: I'm also seeing this:

synapse.storage._base - 435 - WARNING - POST-464 - [TXN OPERROR] {insert_linearized_receipt-71c} could not serialize access due to concurrent update
 0/5

And my nginx-frontend is spewing a lot of these:

    400 Bad Request
       /_matrix/client/unstable/keys/upload/[...snip...]
       /_matrix/client/r0/keys/upload/[...snip...]

for different keys.

After setting the logging-level to INFO, I'm also seeing this:

synapse.access.http.8008 - 233 - INFO - POST-3 - 178.27.192.174 - 8008 - Received request: POST /_matrix/client/unstable/keys/upload/COZTINCKLI
synapse.handlers.e2e_keys - 434 - INFO - POST-3 - Updating device_keys for device 'COZTINCKLI' for user @[...snip...] at 1573637464623
synapse.http.server - 80 - INFO - POST-3 - <XForwardedForRequest at 0x743d4690 method='POST' uri='/_matrix/client/unstable/keys/upload/COZTINCKLI' clientproto='HTTP/1.0' site=8008> 
SynapseError: 400 - The device ID is in use
msirringhaus commented 4 years ago

Ok, this seems to be (partially?) this issue: https://github.com/matrix-org/synapse/issues/6302

I created a new postgres-table, updated the old sqlite-db and reimported. Now it seems a little bit smoother, after cleaning the caches of my clients again and re-verifiying a ton of devices. I hope not all of my users have to do this, though. Either way, I need to test a bit more.

CPU-load is down again.

Pusher-error messages are still there on startup.

pu11em commented 4 years ago

I am also getting these messages after updating to 1.5.1 using SQLite

synapse.storage._base - 435 - WARNING - POST-10290 - [TXN OPERROR] {store_device-70b5} SQL logic error 0/5
synapse.storage.data_stores.main.devices - 699 - ERROR - POST-10290 - store_device with device_id=str('DOTVKJYSNT') user_id=str('@username:server.org') display_name=NoneType(None) failed: Error binding parameter 0 - probably unsupported type.
synapse.storage._base - 435 - WARNING - POST-10317 - [TXN OPERROR] {store_device-70ef} SQL logic error 0/5
synapse.storage.data_stores.main.devices - 699 - ERROR - POST-10317 - store_device with device_id=str('DOTVKJYSNT') user_id=str('@username:server.org') display_name=NoneType(None) failed: Error binding parameter 0 - probably unsupported type.
msirringhaus commented 4 years ago

@pu11em : Not sure if thats the problem, but can you try to adjust your sqlite-database, according to #6302 (make zeros out of the string 'FALSE'). See here on how to do this: https://github.com/matrix-org/synapse/pull/6313/files

But with postgres and modified hidden-attribute, I still get lots of db-errors:

 handle_presence_timeouts-5177 - [TXN OPERROR] {update_presence-4b98} could not serialize access due to concurrent update
 Nov 13 18:28:50 server python[3166]:  0/5
 handle_presence_timeouts-5693 - [TXN OPERROR] {update_presence-5069} could not serialize access due to concurrent update
 Nov 13 19:11:50 server python[3166]:  0/5
 handle_presence_timeouts-6977 - [TXN OPERROR] {update_presence-6051} could not serialize access due to concurrent update
 Nov 13 20:58:50 server python[3166]:  0/5
 handle_presence_timeouts-7709 - [TXN OPERROR] {update_presence-66f6} could not serialize access due to concurrent update
 Nov 13 21:59:50 server python[3166]:  0/5
 handle_presence_timeouts-7889 - [TXN OPERROR] {update_presence-6903} could not serialize access due to concurrent update
 Nov 13 22:14:50 server python[3166]:  0/5

Also push-notifiers are not working, and read-receipt very flaky.

bessw commented 4 years ago

I've got the same problem after upgrading from 1.4.1 to 1.5.1, I'm also using sqlitedb, just a few users.

I roled back to my backup for now.

richvdh commented 4 years ago

people who are having this problem: please can you confirm which version of sqlite you have (something like dpkg -s libsqlite3-0 | grep Version should work)

msirringhaus commented 4 years ago

I migrated from sqlite Version: 3.16.2-5 to postgresql 9.6.13-0

pu11em commented 4 years ago

people who are having this problem: please can you confirm which version of sqlite you have (something like dpkg -s libsqlite3-0 | grep Version should work)

Version: 3.22.0-1ubuntu0.1

richvdh commented 4 years ago

thanks. can you also confirm which version of python you are using?

msirringhaus commented 4 years ago

Python 3.5.3

richvdh commented 4 years ago

I'm utterly failing to reproduce anything like this. would you be able to enable debug logging for the synapse.storage.SQL logger and paste the logs around the "unsupported type" error? (be a bit careful: that log level can include sensitive info like access tokens and passwords)

pu11em commented 4 years ago

I hope this helps!

https://pastebin.com/1vDHc6eP

richvdh commented 4 years ago

that's very strange... it suggests the problem is with the user_id parameter, but I can't see why that would cause a problem (nor why it would have become a problem in this release of synapse)

pu11em commented 4 years ago

It's ok, I guess I should move to postgresql. I was thinking about it, either way.

It is strange though that the first time this error appears in my log file is exactly after restarting my server with the 1.5.1 update. For the record my log file dates back in 2019-04-12 (synapse version 0.99.3)

The first restart after update https://pastebin.com/Hv4NkbSj

msirringhaus commented 4 years ago

So my current state, after migrating from sqlite to postgresql and manually adjusting the hidden-flag from 'FALSE' to 0:

 PUT-20979 - [TXN OPERROR] {add_messages_to_device_inbox-35b2d} could not serialize access due to concurrent update
 Nov 18 08:03:51 server python[3166]:  0/5
 PUT-20981 - [TXN OPERROR] {add_messages_to_device_inbox-35b2f} could not serialize access due to concurrent update
 Nov 18 08:03:51 server python[3166]:  0/5
 PUT-20982 - [TXN OPERROR] {add_messages_to_device_inbox-35b31} could not serialize access due to concurrent update
 Nov 18 08:03:51 server python[3166]:  0/5
 PUT-20982 - [TXN OPERROR] {add_messages_to_device_inbox-35b31} could not serialize access due to concurrent update
 Nov 18 08:03:51 server python[3166]:  1/5
 GET-21374 - Invalid JSON in data for pusher 14888: \x7b226272616[..snip..]f74227d, Expecting value: line 1 column 1 (char 0)
 GET-21407 - Invalid JSON in data for pusher 14888: \x7b226272616[..snip..]f74227d, Expecting value: line 1 column 1 (char 0)
 GET-21411 - Invalid JSON in data for pusher 14888: \x7b226272616[..snip..]f74227d, Expecting value: line 1 column 1 (char 0)
 GET-21452 - Invalid JSON in data for pusher 14888: \x7b226272616[..snip..]f74227d, Expecting value: line 1 column 1 (char 0)
 GET-21460 - Invalid JSON in data for pusher 14888: \x7b226272616[..snip..]f74227d, Expecting value: line 1 column 1 (char 0)
 GET-21474 - Invalid JSON in data for pusher 14888: \x7b226272616[..snip..]f74227d, Expecting value: line 1 column 1 (char 0)
 GET-21476 - Invalid JSON in data for pusher 14888: \x7b226272616[..snip..]f74227d, Expecting value: line 1 column 1 (char 0)
 GET-21574 - Invalid JSON in data for pusher 14888: \x7b226272616[..snip..]f74227d, Expecting value: line 1 column 1 (char 0)
 GET-21593 - Invalid JSON in data for pusher 14888: \x7b226272616[..snip..]f74227d, Expecting value: line 1 column 1 (char 0)
 GET-21599 - Invalid JSON in data for pusher 14888: \x7b226272616[..snip..]f74227d, Expecting value: line 1 column 1 (char 0)
 persist_presence_changes-7334 - [TXN OPERROR] {update_presence-38335} could not serialize access due to concurrent update
 Nov 18 13:32:50 server python[3166]:  0/5
 GET-23219 - Invalid JSON in data for pusher 14888: \x7b226272616[..snip..]f74227d, Expecting value: line 1 column 1 (char 0)
 GET-23223 - Invalid JSON in data for pusher 14888: \x7b226272616[..snip..]f74227d, Expecting value: line 1 column 1 (char 0)
 GET-23225 - Invalid JSON in data for pusher 14888: \x7b226272616[..snip..]f74227d, Expecting value: line 1 column 1 (char 0)
 GET-23229 - Invalid JSON in data for pusher 14888: \x7b226272616[..snip..]f74227d, Expecting value: line 1 column 1 (char 0)
 GET-23723 - Invalid JSON in data for pusher 14888: \x7b226272616[..snip..]f74227d, Expecting value: line 1 column 1 (char 0)
 GET-23725 - Invalid JSON in data for pusher 14888: \x7b226272616[..snip..]f74227d, Expecting value: line 1 column 1 (char 0)
 handle_presence_timeouts-90881 - [TXN OPERROR] {update_presence-39ee2} could not serialize access due to concurrent update
 Nov 18 17:30:50 server python[3166]:  0/5

Everything seems almost normal now, but I think some push-notifications are not going through. Also, these DB-errors worry me a bit.

richvdh commented 4 years ago

It's ok, I guess I should move to postgresql. I was thinking about it, either way.

maybe, but I'd really like to figure out what has gone wrong with sqlite.

@msirringhaus: please can we keep this on-topic to the sqlite "error binding parameter" problem. If you're having other problems with postgres, please open a separate issue.

msirringhaus commented 4 years ago

Sorry, I didn't notice you changed the topic of my initial report. Opening a follow-up bugeport.

pu11em commented 4 years ago

maybe, but I'd really like to figure out what has gone wrong with sqlite.

I would love that, too. What other info can I provide to you?

Is there any way I could delete/purge the devices causing those errors?

richvdh commented 4 years ago

@pu11em : I don't think it's specific to particular devices.

I'm going to have to think a bit harder about what could be causing this. I might send you a test patch to see if we can narrow it down any more.

richvdh commented 4 years ago

@pu11em it's quite difficult to see what's happening in your logs because it looks like you've set the log level to WARNING. Can you set it to INFO and again share some logs including the unsupported type error?

richvdh commented 4 years ago

@pu11em: also, can you confirm if you have made any database-level changes as suggested at https://github.com/matrix-org/synapse/issues/6360#issuecomment-553750982?

richvdh commented 4 years ago

I'm really failing to see what could cause this or even to come up with any good suggestions as to how we could narrow it down.

It's interesting that the 'error binding parameter' error doesn't happen on every call to store_device, but on those it does it is preceded by a SQL logic error. When we get the SQL logic error, we retry the transaction, and it's on the retry that the "Error binding parameter" error happens. Given that the parameter binding process should be identical for both attempts, this is very odd. So further questions include:

:man_shrugging:

it might be worth upgrading to synapse 1.6.0-rc1 which includes the fix at #6313, to see if it makes any difference.

pu11em commented 4 years ago

@pu11em it's quite difficult to see what's happening in your logs because it looks like you've set the log level to WARNING. Can you set it to INFO and again share some logs including the `unsupported type' error?

I changed that and I will wait for the logs and post them

@pu11em: also, can you confirm if you have made any database-level changes as suggested at #6360 (comment)?

I haven't done that yet. It's the next thing I will try.

Just a thought that might help you. I realized that the store_device error was generated for 3-4 users that have installed riot and riotx concurrently on their android devices (myself included). Sorry if this is totally irrelevant I am not a developer.

The logs I'm getting after changed log level to INFO for synapse and DEBUG for SQL

2019-11-22 03:28:54,790 - synapse.access.http.8008 - 302 - INFO - POST-494 - 178.59.62.237 - 8008 - {@user-1:server.org} Processed request: 0.036sec/0.001sec (0.011sec, 0.003sec) (0.008sec/0.008sec/3) 59B 400 "POST /_matrix/client/r0/keys/upload/DOTVKJYSNT HTTP/1.1" "RiotX/0.8.0 (Linux; U; Android 9; MI 5 Build/PQ3A.190801.00; Flavour GooglePlay; MatrixAndroidSDK_X 0.0.1)" [0 dbevts]
2019-11-22 03:28:54,794 - synapse.storage.SQL - 165 - DEBUG - POST-495 - [SQL] {store_device-8c8} INSERT INTO devices (user_id, device_id, display_name, hidden) VALUES(?, ?, ?, ?)
2019-11-22 03:28:54,795 - synapse.storage.SQL - 170 - DEBUG - POST-495 - [SQL values] {store_device-8c8} ('@user-1:server.org', 'DOTVKJYSNT', None, False)
2019-11-22 03:28:54,795 - synapse.storage.SQL - 184 - DEBUG - POST-495 - [SQL time] {store_device-8c8} 0.000370 sec
2019-11-22 03:28:54,800 - synapse.storage.SQL - 165 - DEBUG - POST-495 - [SQL] {_simple_select_one_onecol-8c9} SELECT hidden FROM devices WHERE user_id = ? AND device_id = ?
2019-11-22 03:28:54,801 - synapse.storage.SQL - 170 - DEBUG - POST-495 - [SQL values] {_simple_select_one_onecol-8c9} ['@user-1:server.org', 'DOTVKJYSNT']
2019-11-22 03:28:54,802 - synapse.storage.SQL - 184 - DEBUG - POST-495 - [SQL time] {_simple_select_one_onecol-8c9} 0.000144 sec
2019-11-22 03:28:54,804 - synapse.http.server - 80 - INFO - POST-495 - <XForwardedForRequest at 0x7faca2ba6320 method='POST' uri='/_matrix/client/r0/keys/upload/DOTVKJYSNT' clientproto='HTTP/1.1' site=8008> SynapseError: 400 - The device ID is in use

I am still waiting for the unsupported type error

bessw commented 4 years ago

Just a thought that might help you. I realized that the store_device error was generated for 3-4 users that have installed riot and riotx concurrently on their android devices (myself included). Sorry if this is totally irrelevant I am not a developer.

It happoned to 2 of 3 users on my server, one with RiotX and one without, for both users only the android smartphones were affected, the desktop version still works. But not all android devises were affected.

pu11em commented 4 years ago

some more logs with logger level set to INFO for synapse and DEBUG for sql

2019-11-22 14:09:18,046 - synapse.storage.SQL - 165 - DEBUG - POST-183239 - [SQL] {store_device-7c62a} INSERT INTO devices (user_id, device_id, display_name, hidden) VALUES(?, ?, ?, ?)
2019-11-22 14:09:18,048 - synapse.storage.SQL - 170 - DEBUG - POST-183239 - [SQL values] {store_device-7c62a} ('@user1:server.org', 'YMMISLXHWQ', None, False)
2019-11-22 14:09:18,050 - synapse.metrics - 462 - INFO -  - Collecting gc 1
2019-11-22 14:09:18,069 - synapse.storage.SQL - 184 - DEBUG - POST-183239 - [SQL time] {store_device-7c62a} 0.018059 sec
2019-11-22 14:09:18,074 - synapse.storage._base - 435 - WARNING - POST-183239 - [TXN OPERROR] {store_device-7c62a} SQL logic error 0/5
2019-11-22 14:09:18,079 - synapse.access.http.8008 - 233 - INFO - POST-183249 - IPaddress - 8008 - Received request: POST /_matrix/client/r0/keys/upload/BCTIZCJVVK
2019-11-22 14:09:18,080 - synapse.storage.SQL - 165 - DEBUG - POST-183239 - [SQL] {store_device-7c62a} INSERT INTO devices (user_id, device_id, display_name, hidden) VALUES(?, ?, ?, ?)
2019-11-22 14:09:18,084 - synapse.handlers.e2e_keys - 434 - INFO - POST-183249 - Updating device_keys for device 'BCTIZCJVVK' for user @user2:server.org at 1574424558084
2019-11-22 14:09:18,085 - synapse.storage.SQL - 170 - DEBUG - POST-183239 - [SQL values] {store_device-7c62a} ('@user1:server.org', 'YMMISLXHWQ', None, False)
2019-11-22 14:09:18,088 - synapse.storage.SQL - 184 - DEBUG - POST-183239 - [SQL time] {store_device-7c62a} 0.000122 sec
2019-11-22 14:09:18,090 - synapse.storage.SQL - 165 - DEBUG - POST-183242 - [SQL] {store_device-7c62b} INSERT INTO devices (user_id, device_id, display_name, hidden) VALUES(?, ?, ?, ?)
2019-11-22 14:09:18,091 - synapse.storage.SQL - 170 - DEBUG - POST-183242 - [SQL values] {store_device-7c62b} ('@user2:server.org', 'BCTIZCJVVK', None, False)
2019-11-22 14:09:18,093 - synapse.storage.data_stores.main.devices - 699 - ERROR - POST-183239 - store_device with device_id=str('YMMISLXHWQ') user_id=str('@user1:server.org') display_name=NoneType(None) failed: Error binding parameter 0 - probably unsupported type.
2019-11-22 14:09:18,096 - synapse.http.server - 80 - INFO - POST-183239 - <XForwardedForRequest at 0x7fac93d921d0 method='POST' uri='/_matrix/client/r0/keys/upload/YMMISLXHWQ' clientproto='HTTP/1.1' site=8008> SynapseError: 500 - Problem storing device.
2019-11-22 14:09:18,100 - synapse.access.http.8008 - 302 - INFO - POST-183239 - IP address - 8008 - {@user1:server.org} Processed request: 0.258sec/0.001sec (0.028sec, 0.000sec) (0.061sec/0.071sec/2) 57B 500 "POST /_matrix/client/r0/keys/upload/YMMISLXHWQ HTTP/1.1" "RiotX/0.8.0 (Linux; U; Android 9; Redmi Note 7 MIUI/V10.3.2.0.PFGEUX; Flavour GooglePlay; MatrixAndroidSDK_X 0.0.1)" [0 dbevts]
richvdh commented 4 years ago

thanks. can I see more complete logs? I'm interested in trying to correlate other requests from the same user. DM me if you'd rather not make the logs public: I'm @richvdh:sw1v.org.

richvdh commented 4 years ago

ok so I think what is happening here is that the python sqlite wrapper is using an invalid prepared sqlite statement object. (https://bugs.python.org/issue10513 discusses similar symptoms but I'm not convinced we have the same cause, not least because that bug was fixed in python 3.5.3/3.6.0.)

Essentially it seems that the python sqlite module maintains a cache of prepared sqlite statements objects. However, certain actions can cause a sqlite statement to get marked as finalized, and the python sqlite module doesn't necessarily know about it, which means that next time you try to run the same statement on the same connection, you get this (misleading) "Error binding parameter 0" error.

So the question is what is causing the statement to become invalidated without the python layer knowing about it, and given the one-to-one correlation with [TXN OPERROR] SQL logic error 0/5, I'd assume it's whatever is causing that. (Interestingly, in @pu11em's logs, there are a couple of occasions when it happens on a different statement (get_all_updated_receipts), though in that case I think it's probably related to #6403).

If I was forced to guess, I'd say the OPERROR is coming from an insert happening concurrently with a commit or rollback on another connection - some sort of concurrency issue, though I'm not particularly planning to dig into that.

so, long story short: I think this has always been a potential issue, but has been made much more likely - and visible - by #6302 which makes key uploads fail and therefore devices to hammer on the /keys/upload endpoint.

My expectation is that the problem will go away once #6302 is fixed (in synapse 1.6.0).

richvdh commented 4 years ago

@pu11em synapse 1.6.0 has been out for a few days now. Are you still seeing problems?

pu11em commented 4 years ago

Sorry for the late answer @richvdh. I wanted to be sure that everything was running good. I confirm that problem resolved with synapse 1.6.0. My cpu usage is back to normal, too.

Thank you for your time and amazing support. Keep up the good work.