typokign / matrix-chart

Helm chart for deploying a Matrix homeserver stack
MIT License
89 stars 48 forks source link

Error with Riot-Web if 404 responses from the API are intercepted by the ingress #24

Closed Routhinator closed 4 years ago

Routhinator commented 4 years ago

I keep getting this from the riot-web client after creating my account:

Screenshot from 2020-06-01 22-07-41

And there's a 404 querying:

https://<domain>/_matrix/client/unstable/room_keys/version

It looks like the chart has the latest versions of both, and there are no errors from the synapse container... not sure what could be causing this.

typokign commented 4 years ago

Looks like it might be a harmless timeout?

Can you try updating to v1.4.2 of the chart? I just fixed a small issue with the Riot config that might have been related. If not, I would ask around #riot-web:matrix.org.

Routhinator commented 4 years ago

I'll update however I was digging more and I need to dig up the issue for riot-app I came across but it's not a timeout or a 401, which are other issues that lead to this issue. The problem is the riot-web app is querying _matrix/client/unstable/room_keys/version when the Synapse API has been update and this should now be _matrix/client/r0/room_keys/version - there's some sort of version mismatch.

This is definitely an appbug, but the mismatch of versions leading to this is technically a bug in this chart. I have not been able to successfully get a smooth signup experience, and it's impossible to get encryption keys created due to this error, so none of my messages are encrypted. There are a number of other errors/bugs which may be related to this initial setup fail that I am seeing.

Routhinator commented 4 years ago

Found the issue I was referring to https://github.com/vector-im/riot-android/issues/3072

I updated an have the same issue. Also the issue only occurs against the matrix/synapse server deployed by this chart... matrix.org does not cause this issue.

The result right now is I cannot create or backup my encryption keys and cannot preserve message history as we result because of the 404 errors.

typokign commented 4 years ago

This is definitely an appbug, but the mismatch of versions leading to this is technically a bug in this chart.

Can you clarify what you mean by this? The chart is using the official docker images of the latest stable versions of Synapse and Riot.

By any chance would you happen to have an old homeserver at the same domain? It's possible that Riot could be looking up the .well-known configuration and talking to the out-of-date server.

That's the only cause I can think of at the moment, there definitely should not be a compatibility issue between the latest versions of Synapse/Riot

Routhinator commented 4 years ago

I've never deployed anything in regards to matrix before. There are no old servers. I just took this chart and used it to deploy about two days before I put in my first MR.. and this was on a clean kube cluster.

The MR I put in was a first round of crossing off errors from logs to try to get to the bottom of this issue, but after removing the errors I've seen. There are no errors in regards to this. Riot 1.6.2 is trying to query an API on Synapse 1.14.0 that does not exist. And thats using the defaults plus some required settings from this chart.....

You can test my instance here: https://riot.routh.io - point it to matrix.routh.io for synapse. I'll be able to dig more on the weekend.. maybe I'll finally find a config problem. But so far I cannot get around this problem on a fresh install of this chart with these two versions..

Routhinator commented 4 years ago

So I am poking some more to get to the bottom of this.. Tested riot.im > matrix.routh.io and it's the same behavior.

riot.routh.io > matrix.org - no problems.

So I think I can cross out it being the version of Riot as it's consistent.. It's either the version of Synapse, or a setup problem with Synapse.. still digging.


Dropped back to Synapse 1.13.0 just to see.. no change in the behaviour. So must be something configuration related..

Routhinator commented 4 years ago

Looks like there are a lot of 404 errors breaking the API on this deployment. Did more testing and realized I cannot create new rooms because it cannot query to verify the room name is available due to a bunch more 404 not founds....

69.165.224.135 - - [05/Jun/2020:00:12:28 +0000] "OPTIONS /_matrix/client/r0/directory/room/%23welcome%3Arouth.io HTTP/2.0" 200 2 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:76.0) Gecko/20100101 Firefox/76.0" 53 0.006 [matrix-routh-io-matrix-synapse-80] [] 10.245.84.90:80 2 0.008 200 88b3b9b80e6a2dc36c512eaf13a936a2
69.165.224.135 - - [05/Jun/2020:00:12:28 +0000] "OPTIONS /_matrix/client/r0/directory/room/%23welcome%3Arouth.io HTTP/2.0" 200 2 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:76.0) Gecko/20100101 Firefox/76.0" 53 0.011 [matrix-routh-io-matrix-synapse-80] [] 10.245.84.90:80 2 0.012 200 d7ae47d79b12036e6fe741e8c93dc5a8
69.165.224.135 - - [05/Jun/2020:00:12:28 +0000] "GET /_matrix/client/r0/directory/room/%23welcome%3Arouth.io HTTP/2.0" 404 80 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:76.0) Gecko/20100101 Firefox/76.0" 276 0.024 [upstream-default-backend] [] 10.245.84.90:80 : 10.244.0.156:8080 0 : 63 0.020 : 0.004 404 : 404 047b6f96860cafecf12d28ce9bb3ee9c
69.165.224.135 - - [05/Jun/2020:00:12:28 +0000] "GET /_matrix/client/r0/directory/room/%23welcome%3Arouth.io HTTP/2.0" 404 80 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:76.0) Gecko/20100101 Firefox/76.0" 276 0.040 [upstream-default-backend] [] 10.245.84.90:80 : 10.244.0.156:8080 0 : 63 0.032 : 0.008 404 : 404 f10e9da4961d94e2f3768118747c1155
69.165.224.135 - - [05/Jun/2020:00:12:28 +0000] "GET /_matrix/client/r0/directory/room/%23welcome%3Arouth.io HTTP/2.0" 404 80 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:76.0) Gecko/20100101 Firefox/76.0" 276 0.054 [upstream-default-backend] [] 10.245.84.90:80 : 10.244.1.106:8080 0 : 63 0.032 : 0.024 404 : 404 a53ee4f3c45618bb0c24b97b866b376a
69.165.224.135 - - [05/Jun/2020:00:12:29 +0000] "OPTIONS /_matrix/client/r0/directory/room/%23welcome%3Arouth.io HTTP/2.0" 200 2 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:76.0) Gecko/20100101 Firefox/76.0" 53 0.007 [matrix-routh-io-matrix-synapse-80] [] 10.245.84.90:80 2 0.008 200 0246f75192f4317d8231946eb6b76a95
69.165.224.135 - - [05/Jun/2020:00:12:29 +0000] "OPTIONS /_matrix/client/r0/directory/room/%23welcome%3Arouth.io HTTP/2.0" 200 2 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:76.0) Gecko/20100101 Firefox/76.0" 53 0.011 [matrix-routh-io-matrix-synapse-80] [] 10.245.84.90:80 2 0.012 200 e7c6b76a7a8498cbb8916adcb4542581
69.165.224.135 - - [05/Jun/2020:00:12:29 +0000] "GET /_matrix/client/r0/directory/room/%23welcome%3Arouth.io HTTP/2.0" 404 80 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:76.0) Gecko/20100101 Firefox/76.0" 276 0.020 [upstream-default-backend] [] 10.245.84.90:80 : 10.244.1.106:8080 0 : 63 0.012 : 0.008 404 : 404 c49bfb6cd669f5722c7c3097bd7693f8
69.165.224.135 - - [05/Jun/2020:00:12:29 +0000] "GET /_matrix/client/r0/directory/room/%23welcome%3Arouth.io HTTP/2.0" 404 80 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:76.0) Gecko/20100101 Firefox/76.0" 276 0.025 [upstream-default-backend] [] 10.245.84.90:80 : 10.244.0.156:8080 0 : 63 0.008 : 0.016 404 : 404 3d94ac0bef145a67cf970cbf589297eb
typokign commented 4 years ago

Sounds like a database issue. Are you using the included postgresql chart or your own database?

Routhinator commented 4 years ago

I'm using the included chart. And I was just looking at the DB logs and I can see the queries from creating my account.. No errors or querys from the 404 urls.

Also I don't see any errors on the Synapse logs.

Synapse logs

$ kubectl -n matrix logs -f service/routh-io-matrix-synapse                                                                   
Synapse requires that a list of trusted key servers are specified in order to
provide signing keys for other servers in the federation.

This homeserver does not have a trusted key server configured in
homeserver.yaml and will fall back to the default of 'matrix.org'.

Trusted key servers should be long-lived and stable which makes matrix.org a
good choice for many admins, but some admins may wish to choose another. To
suppress this warning, the admin should set 'trusted_key_servers' in
homeserver.yaml to their desired key server and 'suppress_key_server_warning'
to 'true'.

In a future release the software-defined default will be removed entirely and
the trusted key server will be defined exclusively by the value of
'trusted_key_servers'.
--------------------------------------------------------------------------------
2020-06-04 23:34:01,553 - root - 257 - WARNING - None - ***** STARTING SERVER *****
2020-06-04 23:34:01,554 - root - 258 - WARNING - None - Server /usr/local/lib/python3.7/site-packages/synapse/app/homeserver.py version 1.14.0
Routhinator commented 4 years ago

I think my biggest issue here is log level being at warning. Going to put in a MR to allow loglevel to be configured on Synapse

Routhinator commented 4 years ago

Loglevel bump is definitely helpful..

It almost looks like these 404 errors are to the API spec.. and riot isn't handling them correctly..

Query for room availablity

2020-06-05 02:57:09,152 - synapse.access.http.8008 - 248 - DEBUG - GET-25 - 69.165.224.135 - 8008 - Received request: GET /_matrix/client/r0/directory/room/%23welcome%3Arouth.io
2020-06-05 02:57:09,154 - synapse.storage.txn - 404 - DEBUG - GET-25 - [TXN START] {get_association_from_room_alias-38}
2020-06-05 02:57:09,155 - synapse.storage.SQL - 223 - DEBUG - GET-25 - [SQL] {get_association_from_room_alias-38} SELECT room_id FROM room_aliases WHERE room_alias = ?
2020-06-05 02:57:09,155 - synapse.storage.SQL - 228 - DEBUG - GET-25 - [SQL values] {get_association_from_room_alias-38} ['#welcome:routh.io']
2020-06-05 02:57:09,163 - synapse.storage.SQL - 242 - DEBUG - GET-25 - [SQL time] {get_association_from_room_alias-38} 0.006738 sec
2020-06-05 02:57:09,164 - synapse.storage.txn - 491 - DEBUG - GET-25 - [TXN END] {get_association_from_room_alias-38} 0.009967 sec
2020-06-05 02:57:09,166 - synapse.http.server - 81 - INFO - GET-25 - <XForwardedForRequest at 0x7faed2a42a10 method='GET' uri='/_matrix/client/r0/directory/room/%23welcome%3Arouth.io' clientproto='HTTP/1.1' site=8008> SynapseError: 404 - Room alias #welcome:routh.io not found

This gives me a "room name is in use" error in the UI....

And secret storage (the error I opened this for):

2020-06-05 03:03:41,093 - synapse.access.http.8008 - 311 - INFO - OPTIONS-174 - 69.165.224.135 - 8008 - {None} Processed request: 0.004sec/-0.000sec (0.002sec, 0.000sec) (0.000sec/0.000sec/0) 2B 200 "OPTIONS /_matrix/client/unstable/room_keys/version HTTP/1.1" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:76.0) Gecko/20100101 Firefox/76.0" [0 dbevts]
2020-06-05 03:03:41,167 - synapse.access.http.8008 - 248 - DEBUG - GET-175 - 69.165.224.135 - 8008 - Received request: GET /_matrix/client/unstable/room_keys/version
2020-06-05 03:03:41,169 - synapse.util.async_helpers - 242 - DEBUG - GET-175 - Acquired uncontended linearizer lock 'upload_room_keys_lock' for key '@routhinator:routh.io'
2020-06-05 03:03:41,170 - synapse.storage.txn - 404 - DEBUG - GET-175 - [TXN START] {get_e2e_room_keys_version_info-15c}
2020-06-05 03:03:41,171 - synapse.storage.SQL - 223 - DEBUG - GET-175 - [SQL] {get_e2e_room_keys_version_info-15c} SELECT MAX(version) FROM e2e_room_keys_versions WHERE user_id=? AND deleted=0
2020-06-05 03:03:41,171 - synapse.storage.SQL - 228 - DEBUG - GET-175 - [SQL values] {get_e2e_room_keys_version_info-15c} ('@routhinator:routh.io',)
2020-06-05 03:03:41,175 - synapse.storage.SQL - 242 - DEBUG - GET-175 - [SQL time] {get_e2e_room_keys_version_info-15c} 0.004365 sec
2020-06-05 03:03:41,176 - synapse.storage.SQL - 223 - DEBUG - GET-175 - [SQL] {get_e2e_room_keys_version_info-15c} SELECT version, algorithm, auth_data, etag FROM e2e_room_keys_versions WHERE user_id = ? AND version = ? AND deleted = ?
2020-06-05 03:03:41,176 - synapse.storage.SQL - 228 - DEBUG - GET-175 - [SQL values] {get_e2e_room_keys_version_info-15c} ['@routhinator:routh.io', None, 0]
2020-06-05 03:03:41,177 - synapse.storage.SQL - 242 - DEBUG - GET-175 - [SQL time] {get_e2e_room_keys_version_info-15c} 0.001081 sec
2020-06-05 03:03:41,178 - synapse.storage.database - 483 - DEBUG - GET-175 - [TXN FAIL] {get_e2e_room_keys_version_info-15c} 404: No row found (e2e_room_keys_versions)
2020-06-05 03:03:41,178 - synapse.storage.txn - 491 - DEBUG - GET-175 - [TXN END] {get_e2e_room_keys_version_info-15c} 0.007970 sec
2020-06-05 03:03:41,180 - synapse.util.async_helpers - 255 - DEBUG - GET-175 - Releasing linearizer lock 'upload_room_keys_lock' for key '@routhinator:routh.io'
2020-06-05 03:03:41,180 - synapse.http.server - 81 - INFO - GET-175 - <XForwardedForRequest at 0x7faed2a2a5d0 method='GET' uri='/_matrix/client/unstable/room_keys/version' clientproto='HTTP/1.1' site=8008> SynapseError: 404 - No backup found
2020-06-05 03:03:41,182 - synapse.access.http.8008 - 311 - INFO - GET-175 - 69.165.224.135 - 8008 - {@routhinator:routh.io} Processed request: 0.015sec/0.001sec (0.006sec, 0.002sec) (0.001sec/0.008sec/1) 51B 404 "GET /_matrix/client/unstable/room_keys/version HTTP/1.1" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:76.0) Gecko/20100101 Firefox/76.0" [0 dbevts]
Routhinator commented 4 years ago

Also validated that https://matrix.routh.io/.well-known/matrix/client works

typokign commented 4 years ago

Hate to say it, but this definitely looks like some database corruption. If you had to deploy the chart a few times to get it stable, that may have interrupted something's bootstrapping. I've deployed this chart for a few homeservers, and I've generally gotten into the practice of wiping the database as soon as I get that first stable deploy, just to let everything start over again.

If this is a fairly fresh install, I'd really like to see if this persists after a complete wipe. NB that the database isn't deleted when the chart is uninstalled, the PVC(s) for the postgres chart default to Retain. If you can, could you please try helm uninstall <release>, kubectl get pvc, kubectl delete pvc <any PVCs related to postgres>.

If you want to try and salvage this database, I would ask around #synapse:matrix.org to see what kind of brain surgery you need to give this thing.

Routhinator commented 4 years ago

I have no love for the DB, and in fact I've wiped it 3 times today while testing. Deleted the namspace and all PVCs and reran the job to deploy it.. :/

Just to back that up with output ;)

$ kubectl -n matrix get pvc                                              
NAME                                STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS       AGE
data-routh-io-matrix-postgresql-0   Bound    pvc-3c16ed8d-a205-4054-b88a-d2595bbe2b2b   8Gi        RWO            do-block-storage   3h33m
routh-io-matrix-media-store         Bound    pvc-affebe2e-b50f-450d-9211-33119e498bb7   10Gi       RWO            do-block-storage   3h33m
routh-io-matrix-signing-key         Bound    pvc-e3a8ec20-494b-4dfd-819e-46fafb43b188   1Gi        RWO            do-block-storage   3h33m
Routhinator commented 4 years ago

I might have found the issue... the chart is only specifying ports for the client and federation resources.. All of the 404s are from resources that do not have a listener bound.. testing.

# Valid resource names are:
#
#   client: the client-server API (/_matrix/client), and the synapse admin
#       API (/_synapse/admin). Also implies 'media' and 'static'.
#
#   consent: user consent forms (/_matrix/consent). See
#       docs/consent_tracking.md.
#
#   federation: the server-server API (/_matrix/federation). Also implies
#       'media', 'keys', 'openid'
#
#   keys: the key discovery API (/_matrix/keys).
#
#   media: the media API (/_matrix/media).
#
#   metrics: the metrics interface. See docs/metrics-howto.md.
#
#   openid: OpenID authentication.
#
#   replication: the HTTP replication API (/_synapse/replication). See
#       docs/workers.md.
#
#   static: static resources under synapse/static (/_matrix/static). (Mostly
#       useful for 'fallback authentication'.)
#
#   webclient: A web client. Requires web_client_location to be set.
Routhinator commented 4 years ago

Looking a bit more carefully my aha moment negelected to realize all the key endpoints that are failing for me are under _matrix/client and thus not part of the keys resource...

typokign commented 4 years ago

Yeah, those listeners are only intended for workers when sharding Synapse (which is a huge pain and probably not something I'm going to attempt to implement in this chart, I'd rather wait for Dendrite).

The client listener automatically enables media and static, and the federation listener enables media, keys, and openid.

Routhinator commented 4 years ago

Finally found the issue here. Many ingress-nginx installs have proxy_intercept_errors on by default and return a nice or default error page on a 404 instead of the error from the backend. But in this case.. the 404 responses are part of the API spec and Riot requires the proper json response to handle things correctly.

I've resolved this by adding the following annotation for the synapse ingress:

      nginx.ingress.kubernetes.io/configuration-snippet: |
        proxy_intercept_errors off;

Now riot gets this response:

{
    "errcode": "M_NOT_FOUND",
    "error": "No backup found"
}

Which is what it is expecting.

typokign commented 4 years ago

Aha! That explains it. I'll put up a patch to include that by default. Come to think of it, I think I may have been too quick to dismiss any nginx-ingress-provider-specific patches yesterday. Since these are just annotations, I think it's perfectly reasonable to include them by default, and just make a note in the docs that they won't do anything if you're using a different Ingress provider (but they also won't break anything either).

So I'll also try to get around to adding that .well-known/matrix/server addition, now that I know we can include it without needing an extra Pod.

Thanks!

typokign commented 4 years ago

84f2799

hexxone commented 1 year ago

Finally found the issue here. Many ingress-nginx installs have proxy_intercept_errors on by default and return a nice or default error page on a 404 instead of the error from the backend. But in this case.. the 404 responses are part of the API spec and Riot requires the proper json response to handle things correctly.

Thank you so much! This was the hint I have also long been searching for.

I have a little unusual setup using "Traefik" and a nginx proxy and was using custom error-pages on my nginx router. Since it was falsely showing as "CORS" error I could not figure out the problem for the love of god ...

But simply removing the error-pages middleware from my proxy nginx fixxed the problem.

DX37 commented 1 year ago

Finally found the issue here. Many ingress-nginx installs have proxy_intercept_errors on by default and return a nice or default error page on a 404 instead of the error from the backend. But in this case.. the 404 responses are part of the API spec and Riot requires the proper json response to handle things correctly.

I've resolved this by adding the following annotation for the synapse ingress:

      nginx.ingress.kubernetes.io/configuration-snippet: |
        proxy_intercept_errors off;

Now riot gets this response:

{
    "errcode": "M_NOT_FOUND",
    "error": "No backup found"
}

Which is what it is expecting.

Oh. My. God. I searched for MONTHS, why my self-hosted synapse installation always says unable to query secret storage status or Secret storage creation canceled in web logs. And FINALLY, your message hinted me to check nginx.conf for similar strings. And there was proxy_intercept_errors on;, and after deleting it all works again. I was probably trying to add own error pages too, but after I wrote them, there wasn't need in that switch.

Thanks!