typokign / matrix-chart

Helm chart for deploying a Matrix homeserver stack
MIT License
89 stars 48 forks source link

Upgraded to 1.12.4 and now the pod restarts #20

Closed danjenkins closed 4 years ago

danjenkins commented 4 years ago

I upped the logging to INFO and it looks like it restarts during a federation loop

2020-04-28 12:33:26,445 - synapse.http.federation.matrix_federation_agent - 242 - INFO - federation_transaction_transmission_loop-531 - Connecting to skalarprodukt.de:8448
2020-04-28 12:33:26,447 - synapse.http.federation.matrix_federation_agent - 242 - INFO - federation_transaction_transmission_loop-697 - Connecting to matrix.allmende.io:443
2020-04-28 12:33:26,449 - synapse.http.federation.matrix_federation_agent - 242 - INFO - federation_transaction_transmission_loop-1011 - Connecting to chat.phuks.co:8448
2020-04-28 12:33:26,455 - synapse.http.federation.matrix_federation_agent - 242 - INFO - federation_transaction_transmission_loop-504 - Connecting to 10friends.info:8448
2020-04-28 12:33:26,457 - synapse.http.federation.matrix_federation_agent - 242 - INFO - federation_transaction_transmission_loop-572 - Connecting to tozein.com:8448
2020-04-28 12:33:26,459 - synapse.http.federation.matrix_federation_agent - 242 - INFO - federation_transaction_transmission_loop-501 - Connecting to synapse.asra.gr:443
2020-04-28 12:33:26,464 - synapse.http.federation.matrix_federation_agent - 242 - INFO - federation_transaction_transmission_loop-901 - Connecting to matrix.oldrevelminds.com:8448
2020-04-28 12:33:26,467 - synapse.http.federation.matrix_federation_agent - 242 - INFO - federation_transaction_transmission_loop-664 - Connecting to chat.138.io:8448
2020-04-28 12:33:26,469 - synapse.http.federation.matrix_federation_agent - 242 - INFO - federation_transaction_transmission_loop-297 - Connecting to matrix.rptc.bid:8448
2020-04-28 12:33:26,471 - synapse.http.federation.matrix_federation_agent - 242 - INFO - federation_transaction_transmission_loop-412 - Connecting to matrix.decent.fund:8448
2020-04-28 12:33:26,473 - synapse.http.federation.matrix_federation_agent - 242 - INFO - federation_transaction_transmission_loop-995 - Connecting to matrix.p6ril.fr:8448
2020-04-28 12:33:26,476 - synapse.http.federation.matrix_federation_agent - 242 - INFO - federation_transaction_transmission_loop-852 - Connecting to rduce.org:8448
2020-04-28 12:33:26,478 - synapse.http.federation.matrix_federation_agent - 242 - INFO - federation_transaction_transmission_loop-354 - Connecting to matrix.grin.hu:8448
2020-04-28 12:33:26,481 - synapse.http.federation.matrix_federation_agent - 242 - INFO - federation_transaction_transmission_loop-382 - Connecting to matrix.trustserv.de:8448
2020-04-28 12:33:26,491 - synapse.http.federation.matrix_federation_agent - 242 - INFO - federation_transaction_transmission_loop-253 - Connecting to jhammons.io:8449
2020-04-28 12:33:26,494 - synapse.http.federation.matrix_federation_agent - 242 - INFO - federation_transaction_transmission_loop-806 - Connecting to hackerfraternity.org:8448
2020-04-28 12:33:26,496 - synapse.http.federation.matrix_federation_agent - 242 - INFO - federation_transaction_transmission_loop-370 - Connecting to chat.eritonito.com:8448
2020-04-28 12:33:26,498 - synapse.http.federation.matrix_federation_agent - 242 - INFO - federation_transaction_transmission_loop-731 - Connecting to matrix.antonionapolitano.eu:8448
2020-04-28 12:33:26,505 - synapse.http.federation.matrix_federation_agent - 242 - INFO - federation_transaction_transmission_loop-373 - Connecting to tenpthree.net:8448
2020-04-28 12:33:26,508 - synapse.http.federation.matrix_federation_agent - 242 - INFO - federation_transaction_transmission_loop-983 - Connecting to matrix.1312.media:8448
2020-04-28 12:33:26,510 - synapse.http.federation.matrix_federation_agent - 242 - INFO - federation_transaction_transmission_loop-900 - Connecting to backstop.modular.im:443
2020-04-28 12:33:26,512 - synapse.http.federation.matrix_federation_agent - 242 - INFO - federation_transaction_transmission_loop-792 - Connecting to matrix.ffslfl.net:443
2020-04-28 12:33:26,518 - synapse.http.federation.matrix_federation_agent - 242 - INFO - federation_transaction_transmission_loop-535 - Connecting to nets.sh:8448
2020-04-28 12:33:26,520 - synapse.http.federation.matrix_federation_agent - 242 - INFO - federation_transaction_transmission_loop-317 - Connecting to heavner.dev:8448
2020-04-28 12:33:26,522 - synapse.http.federation.matrix_federation_agent - 242 - INFO - federation_transaction_transmission_loop-368 - Connecting to resplendent.company:8448
2020-04-28 12:33:26,524 - synapse.http.federation.matrix_federation_agent - 242 - INFO - federation_transaction_transmission_loop-518 - Connecting to matrix.jon02.ch:8448
2020-04-28 12:33:26,525 - twisted - 192 - INFO -  - Main loop terminated.

Anyone else had problems with 1.12.4 before I look into it further and submit an upstream issue?

danjenkins commented 4 years ago

Another loop's log

2020-04-28 12:45:10,451 - synapse.http.federation.matrix_federation_agent - 242 - INFO - federation_transaction_transmission_loop-478 - Connecting to chat.python.nz:8448
2020-04-28 12:45:10,457 - synapse.http.federation.matrix_federation_agent - 242 - INFO - federation_transaction_transmission_loop-895 - Connecting to matrix.familie-vogdt.de:443
2020-04-28 12:45:10,459 - synapse.http.federation.matrix_federation_agent - 242 - INFO - federation_transaction_transmission_loop-893 - Connecting to rrra.bid:8448
2020-04-28 12:45:10,460 - synapse.http.federation.matrix_federation_agent - 242 - INFO - federation_transaction_transmission_loop-761 - Connecting to matrix.nilux.be:8448
2020-04-28 12:45:10,492 - synapse.http.matrixfederationclient - 434 - INFO - federation_transaction_transmission_loop-568 - Failed to send request: <Message id=11566 rCode=2 maxSize=0 flags=answer,recDes,recAv queries=[Query(b'_matrix._tcp.synapse.pantarhei.duckdns.org', 33, 1)]>
2020-04-28 12:45:10,493 - synapse.http.matrixfederationclient - 491 - WARNING - federation_transaction_transmission_loop-568 - {PUT-O-545} [synapse.pantarhei.duckdns.org] Request failed: PUT matrix://synapse.pantarhei.duckdns.org/_matrix/federation/v1/send/1588077218366: DNSServerError(<Message id=11566 rCode=2 maxSize=0 flags=answer,recDes,recAv queries=[Query(b'_matrix._tcp.synapse.pantarhei.duckdns.org', 33, 1)]>)
2020-04-28 12:45:10,495 - synapse.http.matrixfederationclient - 434 - INFO - federation_transaction_transmission_loop-479 - Failed to send request: <Message id=39429 rCode=2 maxSize=0 flags=answer,recDes,recAv queries=[Query(b'_matrix._tcp.matrix.catila.duckdns.org', 33, 1)]>
2020-04-28 12:45:10,510 - synapse.http.matrixfederationclient - 491 - WARNING - federation_transaction_transmission_loop-479 - {PUT-O-490} [matrix.catila.duckdns.org] Request failed: PUT matrix://matrix.catila.duckdns.org/_matrix/federation/v1/send/1588077218311: DNSServerError(<Message id=39429 rCode=2 maxSize=0 flags=answer,recDes,recAv queries=[Query(b'_matrix._tcp.matrix.catila.duckdns.org', 33, 1)]>)
2020-04-28 12:45:10,510 - twisted - 192 - INFO -  - Main loop terminated.
danjenkins commented 4 years ago

Humph doesnt look like an issue with that version, rolled back to a previous version

typokign commented 4 years ago

Hmm, not sure what's going on but it's probably not the federation spam. That happens whenever you're in a room with users from a homeserver that no longer exists (someone shut down their homeserver without leaving all public rooms first).

Can you kubectl describe the pod and see why it's restarting? Did the process crash or did the health check fail?

danjenkins commented 4 years ago

Does look like the liveness probe failed!

Events:
  Type     Reason     Age                     From                           Message
  ----     ------     ----                    ----                           -------
  Normal   Killing    100s (x38 over 7h51m)   kubelet, pool-8a97vo5tx-3nbno  Container synapse failed liveness probe, will be restarted
  Warning  Unhealthy  100s (x123 over 7h51m)  kubelet, pool-8a97vo5tx-3nbno  Liveness probe failed: Get http://10.244.0.161:8008/_matrix/static/: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
  Warning  Unhealthy  99s (x128 over 7h51m)   kubelet, pool-8a97vo5tx-3nbno  Readiness probe failed: Get http://10.244.0.161:8008/_matrix/static/: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
  Normal   Started    94s (x39 over 7h54m)    kubelet, pool-8a97vo5tx-3nbno  Started container synapse
  Normal   Pulled     94s (x39 over 7h54m)    kubelet, pool-8a97vo5tx-3nbno  Container image "matrixdotorg/synapse:v1.10.0" already present on machine
  Normal   Created    94s (x39 over 7h54m)    kubelet, pool-8a97vo5tx-3nbno  Created container synapse
  Warning  Unhealthy  91s (x24 over 7h54m)    kubelet, pool-8a97vo5tx-3nbno  Liveness probe failed: Get http://10.244.0.161:8008/_matrix/static/: dial tcp 10.244.0.161:8008: connect: connection refused
  Warning  Unhealthy  90s (x47 over 7h54m)    kubelet, pool-8a97vo5tx-3nbno  Readiness probe failed: Get http://10.244.0.161:8008/_matrix/static/: dial tcp 10.244.0.161:8008: connect: connection refused
danjenkins commented 4 years ago
    Port:           8008/TCP
    Host Port:      0/TCP
    State:          Running
      Started:      Tue, 28 Apr 2020 21:50:56 +0100
    Last State:     Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Tue, 28 Apr 2020 21:47:55 +0100
      Finished:     Tue, 28 Apr 2020 21:50:56 +0100
    Ready:          True
    Restart Count:  38
    Liveness:       http-get http://:http/_matrix/static/ delay=0s timeout=1s period=10s #success=1 #failure=3
    Readiness:      http-get http://:http/_matrix/static/ delay=0s timeout=1s period=10s #success=1 #failure=3
typokign commented 4 years ago

Thanks, that's good to know. Let's try this:

  1. kubectl edit the deployment and remove the liveness/readiness probes for a moment
  2. kubectl port-forward matrix-synapse-blahblahblah 8008
  3. Try to manually hit localhost:8008 and localhost:8008/_matrix/static/

If you can hit the endpoints manually, leave the probes disabled and see if you can use the homeserver as normal. We can keep looking into why the probes are failing but that should give you a temporary fix.

If you can't reach the endpoints or the homeserver still doesn't work with the probes disabled, then we can take a closer look at your logs and see if it's an issue with the chart or with Synapse.

Thanks!

danjenkins commented 4 years ago

thanks @dacruz21 ! I need to up the docker image again so I'll just comment it out in the deployment file for now (still using my edited fork at the moment)

danjenkins commented 4 years ago

So currently I can hit it. Will report back behaviours from tomorrow !

danjenkins commented 4 years ago

So it seems the container hasnt restarted itself, and I seem to be able to chat to people fine for a long period of time....

typokign commented 4 years ago

Hey @danjenkins, I've bumped the default timeout for the Synapse health probes. Can you update to 2.1.0 and let me know if you're still getting this issue? Feel free to raise/lower them further in synapse.probes, 5 seconds is probably more generous than you want to be, but if you're on some old hardware, then maybe it's not generous enough :)

Routhinator commented 4 years ago

@dacruz21 Circling back on the timeout issue, 5 seconds seems to not be generous enough. It's much better but I am averaging 5 restarts in 24 hours due to healthcheck failure.