matrix-org / synapse

Synapse: Matrix homeserver written in Python/Twisted.
https://matrix-org.github.io/synapse
Apache License 2.0
11.82k stars 2.13k forks source link

Synapse segfaults on alpinelinux #1138

Closed pbruna closed 6 years ago

pbruna commented 8 years ago

Hi, Im running Synapse v0.18.0 with Python 2.7.12. The users are authenticated agains our LDAP server. When using Riot.im on iOS everything works ok, but the instant I login with Chrome, or Safari, the servers crash and I guess the DB gets corrupted because I can't even use the iOS App anymore.

The las request Riot make before the crash is always a synapse.rest.client.v2_alpha.sync, like:

2016-09-22 14:40:55,243 - synapse.access.https.8448 - 59 - INFO - GET-1- <IP_ADDRESS> - 8448 - Received request: GET /_matrix/client/r0/sync?filter=0&timeout=0&since=s1_2_0_1_1_1_1&access_token=<redacted>

Which produces the following log:

2016-09-22 14:40:55,243 - synapse.access.https.8448 - 59 - INFO - GET-1- 201.214.200.227 - 8448 - Received request: GET /_matrix/client/r0/sync?filter=0&timeout=0&since=s1_2_0_1_1_1_1&access_token=<redacted>
2016-09-22 14:40:55,245 - synapse.storage.txn - 224 - DEBUG - - [TXN START] {get_user_by_access_token-2}
2016-09-22 14:40:55,246 - synapse.storage.txn - 280 - DEBUG - - [TXN END] {get_user_by_access_token-2} 0.621094
2016-09-22 14:40:55,246 - synapse.util.logcontext - 243 - DEBUG - - Unexpected logging context: GET-1@7f045188cbb0 is not sentinel
2016-09-22 14:40:55,247 - synapse.rest.client.v2_alpha.sync - 114 - INFO - GET-1- /sync: user=DomainSpecificString(localpart=u'username', domain=u'matrix.example.com'), timeout=0, since='s1_2_0_1_1_1_1', set_presence='online', filter_id='0', device_id=u'UNNJJGIJAV'
2016-09-22 14:40:55,247 - synapse.metrics - 212 - INFO - - Collecting gc 0
2016-09-22 14:40:55,248 - synapse.storage.txn - 224 - DEBUG - - [TXN START] {insert_client_ip-3}
2016-09-22 14:40:55,249 - synapse.metrics - 212 - INFO - - Collecting gc 1
2016-09-22 14:40:55,249 - synapse.storage._base - 483 - DEBUG - - [SQL] UPDATE user_ips SET last_seen = ? WHERE access_token = ? AND ip = ? AND user_id = ? AND user_agent = ? AND device_id = ? Args=[1474555255247, 'MDAyMGxvY2F0aW9uIG1hdHJpeC56Ym94YXBwLmNvbQowMDEzaWRlbnRpZmllciBrZXkKMDAxMGNpZCBnZW4gPSAxCjAwMmRgPCAxNDc0NTU4NDUyNDQ4CjAwMmZzaWduYXR1cmUgU08gtiGusp7e3by4rWozEPvS4ZPLmNJvopHRCcuwXU8K', '201.224.209.256', u'@username:matrix.example.com', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36', u'U769GIJAV']
2016-09-22 14:40:55,275 - synapse.storage.txn - 280 - DEBUG - - [TXN END] {insert_client_ip-3} 27.266113
2016-09-22 14:40:55,276 - synapse.storage.txn - 224 - DEBUG - - [TXN START] {get_user_filter-4}
2016-09-22 14:40:55,276 - synapse.storage.txn - 280 - DEBUG - - [TXN END] {get_user_filter-4} 0.373047
2016-09-22 14:40:55,278 - synapse.util.logcontext - 243 - DEBUG - GET-1- Unexpected logging context: sentinel is not GET-1@7f045188cbb0
2016-09-22 14:40:55,278 - synapse.storage.txn - 224 - DEBUG - GET-1- [TXN START] {get_updated_account_data_for_user-5}
/start.sh: line 57:    16 Segmentation fault      (core dumped) python -m synapse.app.homeserver --config-path /data/homeserver.yaml

Note: I changed the real username, domain and IPAddress.

Before the Crash, Riot make the following request without problem:

  1. https://matrix.example.com/_matrix/client/r0/presence/username/status
  2. https://matrix.example.com/_matrix/client/unstable/keys/upload/JZBZGPVPPK?access_token=
  3. https://matrix.example.com/_matrix/client/r0/voip/turnServer?access_token
  4. https://matrix.example.com/_matrix/client/r0/pushrules/?access_token
  5. https://matrix.example.com/_matrix/client/r0/user/user/filter?access_token
  6. https://matrix.example.com/_matrix/client/r0/sync?filter=0&timeout=30000&_cacheBuster=1474554706574&access_token
  7. https://matrix.example.com/_matrix/client/unstable/sendToDevice/m.new_device/m147455436634.0?access_token=
  8. https://matrix.example.com/_matrix/client/r0/user/username/account_data/m.direct?access_token
  9. https://matrix.example.com/_matrix/client/r0/publicRooms?access_token=

Tha last return an empty {} because there is no public rooms. And the next request is the one that crash the server:

https://matrix.example.com/_matrix/client/r0/sync?filter=0&timeout=30000&since=s1_2_0_1_1_1_1&access_token=

Yesterday I debuged this and found that the crash happen when the json.loads method runs, of the file storage/account_data.py:

global_account_data = {     
  row["account_data_type"]: json.loads(row["content"]) for row in rows
}

The SegFault Info of GDB is:

Core was generated by `python -m synapse.app.homeserver --config-path /data/homeserver.yaml'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007f04551a3a30 in JSON_DecodeObject () from /usr/lib/python2.7/site-packages/ujson.so

And the trace:

#0  0x00007f04551a3a30 in JSON_DecodeObject () from /usr/lib/python2.7/site-packages/ujson.so
#1  0x00007f04551a198f in JSONToObj () from /usr/lib/python2.7/site-packages/ujson.so
#2  0x00007f045b330cd8 in PyEval_EvalFrameEx () from /usr/lib/libpython2.7.so.1.0
#3  0x00007f045b332b4c in PyEval_EvalCodeEx () from /usr/lib/libpython2.7.so.1.0
#4  0x00007f045b330e5a in PyEval_EvalFrameEx () from /usr/lib/libpython2.7.so.1.0
#5  0x00007f045b332b4c in PyEval_EvalCodeEx () from /usr/lib/libpython2.7.so.1.0
#6  0x00007f045b2d6e02 in ?? () from /usr/lib/libpython2.7.so.1.0
#7  0x00007f045b2ba13c in PyObject_Call () from /usr/lib/libpython2.7.so.1.0
#8  0x00007f045b33237f in PyEval_EvalFrameEx () from /usr/lib/libpython2.7.so.1.0
#9  0x00007f045b332b4c in PyEval_EvalCodeEx () from /usr/lib/libpython2.7.so.1.0
#10 0x00007f045b2d6e02 in ?? () from /usr/lib/libpython2.7.so.1.0
#11 0x00007f045b2ba13c in PyObject_Call () from /usr/lib/libpython2.7.so.1.0
#12 0x00007f045b33237f in PyEval_EvalFrameEx () from /usr/lib/libpython2.7.so.1.0
#13 0x00007f045b332b4c in PyEval_EvalCodeEx () from /usr/lib/libpython2.7.so.1.0
#14 0x00007f045b2d6e02 in ?? () from /usr/lib/libpython2.7.so.1.0
#15 0x00007f045b2ba13c in PyObject_Call () from /usr/lib/libpython2.7.so.1.0
#16 0x00007f045b33237f in PyEval_EvalFrameEx () from /usr/lib/libpython2.7.so.1.0
#17 0x00007f045b332b4c in PyEval_EvalCodeEx () from /usr/lib/libpython2.7.so.1.0
#18 0x00007f045b2d6e02 in ?? () from /usr/lib/libpython2.7.so.1.0
#19 0x00007f045b2ba13c in PyObject_Call () from /usr/lib/libpython2.7.so.1.0
#20 0x00007f045b33237f in PyEval_EvalFrameEx () from /usr/lib/libpython2.7.so.1.0
#21 0x00007f045b332b4c in PyEval_EvalCodeEx () from /usr/lib/libpython2.7.so.1.0
#22 0x00007f045b2d6e02 in ?? () from /usr/lib/libpython2.7.so.1.0
#23 0x00007f045b2ba13c in PyObject_Call () from /usr/lib/libpython2.7.so.1.0
#24 0x00007f045b33237f in PyEval_EvalFrameEx () from /usr/lib/libpython2.7.so.1.0
#25 0x00007f045b332b4c in PyEval_EvalCodeEx () from /usr/lib/libpython2.7.so.1.0
#26 0x00007f045b2d6e02 in ?? () from /usr/lib/libpython2.7.so.1.0
#27 0x00007f045b2ba13c in PyObject_Call () from /usr/lib/libpython2.7.so.1.0
#28 0x00007f045b33237f in PyEval_EvalFrameEx () from /usr/lib/libpython2.7.so.1.0
#29 0x00007f045b332b4c in PyEval_EvalCodeEx () from /usr/lib/libpython2.7.so.1.0
#30 0x00007f045b330e5a in PyEval_EvalFrameEx () from /usr/lib/libpython2.7.so.1.0
#31 0x00007f045b332b4c in PyEval_EvalCodeEx () from /usr/lib/libpython2.7.so.1.0
#32 0x00007f045b330e5a in PyEval_EvalFrameEx () from /usr/lib/libpython2.7.so.1.0
#33 0x00007f045b332b4c in PyEval_EvalCodeEx () from /usr/lib/libpython2.7.so.1.0
#34 0x00007f045b330e5a in PyEval_EvalFrameEx () from /usr/lib/libpython2.7.so.1.0
#35 0x00007f045b332b4c in PyEval_EvalCodeEx () from /usr/lib/libpython2.7.so.1.0
#36 0x00007f045b2d6e02 in ?? () from /usr/lib/libpython2.7.so.1.0
#37 0x00007f045b2ba13c in PyObject_Call () from /usr/lib/libpython2.7.so.1.0
#38 0x00007f045b33237f in PyEval_EvalFrameEx () from /usr/lib/libpython2.7.so.1.0
#39 0x00007f045b330dfc in PyEval_EvalFrameEx () from /usr/lib/libpython2.7.so.1.0
#40 0x00007f045b330dfc in PyEval_EvalFrameEx () from /usr/lib/libpython2.7.so.1.0
#41 0x00007f045b332b4c in PyEval_EvalCodeEx () from /usr/lib/libpython2.7.so.1.0
#42 0x00007f045b2d6e02 in ?? () from /usr/lib/libpython2.7.so.1.0
#43 0x00007f045b2ba13c in PyObject_Call () from /usr/lib/libpython2.7.so.1.0
#44 0x00007f045b2c5d07 in ?? () from /usr/lib/libpython2.7.so.1.0
#45 0x00007f045b2ba13c in PyObject_Call () from /usr/lib/libpython2.7.so.1.0
#46 0x00007f045b32ac88 in PyEval_CallObjectWithKeywords () from /usr/lib/libpython2.7.so.1.0
#47 0x00007f045b357a95 in ?? () from /usr/lib/libpython2.7.so.1.0
#48 0x00007f045b6674ce in ?? () from /lib/ld-musl-x86_64.so.1
#49 0x0000000000000000 in ?? ()
richvdh commented 8 years ago

I think this is an issue with the docker image. I suggest raising it at https://github.com/silvio/docker-matrix.

silvio commented 8 years ago

The problem isn't docker related. This problem exists on all alpine installations without docker.

jcgruenhage commented 7 years ago

I have this issue too, on Synapse v0.18.7, would you please reopen the issue?

richvdh commented 7 years ago

It's an issue with alpine linux, not synapse.

jcgruenhage commented 7 years ago

okay, thanks

jcgruenhage commented 7 years ago

Has this issue been reported on their side somewhere? I can not find anything in their bugtracker. Edit: After searching some more, and still not finding anything, I posted a bug report: https://bugs.alpinelinux.org/issues/6765

ptman commented 7 years ago

Yes, same problem. Can synapse avoid using ujson?

jcgruenhage commented 7 years ago

@ptman I don't think it will. Dendrite however will most likely not use ujson, once that is ready. Until then, I guess you'll just need to avoid alpine linux for running synapse.

ptman commented 7 years ago

I also tried to report this problem to ujson https://github.com/esnme/ultrajson/issues/254

richvdh commented 7 years ago

We're not going to stop using ujson just because alpinelinux ship a broken version of it, no.

richvdh commented 7 years ago

(or rather, whatever brokenness it is that alpinelinux has)

pbruna commented 7 years ago

@richvdh I second that, but I think it will be a good idea to Throw a better error when the json parsing fails.

richvdh commented 7 years ago

A segfault takes out the whole python process. We can't catch it and recover from within python.

kdltr commented 7 years ago

Ugly workaround: https://github.com/kdltr/synapse/commit/4d6cfa4363cd4a8f9ea91cfdea9e4dac82489f9b

ara4n commented 7 years ago

hah! well, i guess that's one way to work around the segfaulting dep :D

jcgruenhage commented 7 years ago

How much faster than json is ujson anyway?

DmitrySandalov commented 7 years ago

The following workaround seems to be working for me: https://github.com/esnme/ultrajson/issues/254#issuecomment-314862445

richvdh commented 7 years ago

Despite wishful thinking on my part, this doesn't seem to have been fixed at ultrajson or alpinelinux, so it looks like we're going to have to find a workaround for this.

richvdh commented 7 years ago

How much faster than json is ujson anyway?

"much", according to erik.

jcgruenhage commented 7 years ago

@richvdh hasn't the workaround already been found? s/ujson/json/g It's a simple enough patch.

jcgruenhage commented 7 years ago

also, there is an open PR, they seem to be working on it over at ujson: https://github.com/esnme/ultrajson/pull/281

richvdh commented 7 years ago

@richvdh hasn't the workaround already been found? s/ujson/json/g

We'd prefer not to drop all use of ujson, for performance reasons.

richvdh commented 7 years ago

Could anyone who has been having trouble with this see if the patch in https://github.com/matrix-org/synapse/commit/e961569e9330da2108fbf75d8d51938a953c8469 helps them?

jcgruenhage commented 7 years ago

I've switched to Debian as a base long ago, but I still have the dockerfile from when I tried to use alpine. Do you want it?

richvdh commented 7 years ago

I was trying to avoid setting up docker environments and hoping somebody else could test it...

b4nst commented 7 years ago

I've tested your commit in this image. So far so good. I will come back here if I notice something bad.

clandmeter commented 6 years ago

For anyone who is following this discussion. We added a fix for stacksize issues within python which should fix the ujson segfault.

Its in alpine edge and backported to v3.7. Hope somebody can verify this so we can close this issue.

ref: https://github.com/alpinelinux/aports/commit/2f35283fec8ec451fe5fb477dd32ffdcc0776e89

ptman commented 6 years ago

@clandmeter Seems to work great, thanks @richvdh e961569 no longer seems necessary

Anyone interested can take a look at https://github.com/ptman/synapse-docker and https://hub.docker.com/r/ptman/synapse/

richvdh commented 6 years ago

great news. @silvio any chance you could bump your docker images to include the fix?

jcgruenhage commented 6 years ago

@richvdh: @silvio has stopped maintaining the docker image, and it was Debian based for more than a year now anyway.

silvio commented 6 years ago

@jcgruenhage @richvdh : This is true, the current state of the dockerimages are using debian. I have added @andreaspeters as contributor to my docker repository. Maybe he wants to switch back to alpine based distributions.

andreaspeters commented 6 years ago

@silvio @richvdh @jcgruenhage : I my fork I changed to "debian:stable-slim". Before I will merge it with the official docker-matrix version, I need some peoples who like to test it (https://hub.docker.com/r/avhost/docker-matrix/tags/). Just now, I have no plan to change back to alpine.

richvdh commented 6 years ago

thanks all. either way it sounds like we can finally close this bug.