matrix-org / sygnal

Sygnal: reference Push Gateway for Matrix
Apache License 2.0
160 stars 144 forks source link

Investigate IndexError in gcm's `_dispatch_notification_unlimited` #255

Closed DMRobertson closed 2 years ago

DMRobertson commented 2 years ago

(internal users only:) https://sentry.matrix.org/sentry/sygnal-matrixorg/issues/201296/?query=is%3Aunresolved

The list comprehension that builds pushkeys might produce an empty list, in which case we get the index error as reported on sentry.

This logic seems to have been here since eeea1a71a1aa70e629a05f5d85e6f7d7c3406c7a in Fed 2015. The first sentry report is from Mar 6, 2021.

Questions:

reivilibre commented 2 years ago

I saw this pop up in my e-mails.

It doesn't make sense for someone to send a push notification to zero pushkeys but I guess someone has accidentally done that anyway.

It would probably be sensible to catch this case and either 200 it without doing anything (OK! I've done nothing as you requested) or 400 Bad Request it.

Technically the spec doesn't say that there should be at least 1 device, but the call doesn't make any sense if you don't specify a device, so either option seems reasonable to me (though perhaps 400 Bad Request is a clearer error for when an implementation gets it wrong and is trying to figure out why push doesn't work).

DMRobertson commented 2 years ago

Assuming that the pushkey is part of the HTTP request, I agree that 400 "You messed up" is the right response here.

But having taken another look I'm not sure how we actually end up in this situation. My reading: for each device d in notif.devices we check that exactly one pushkin exists matching its the devices's app_id.

https://github.com/matrix-org/sygnal/blob/ff1e98ed494cf388606c4dd33eb38a0f67593cc2/sygnal/http.py#L243-L269

We then dispatch to the single device d via dispatch_notification and _dispatch_notification_unlimited. By the time we're in the latter function, I think it should always be the case that the device argument to _dispatch_notification_unlimited should always satisfy device.app_id == self.name. And so I think this list comprehension should never be empty?

https://github.com/matrix-org/sygnal/blob/d52568e248a00015e6759d354bc80205f3069709/sygnal/gcmpushkin.py#L300-L305

If we are dispatching to just a single device, I don't understand why we are even looking at n.devices in the first place (and I couldn't see other pushkins doing something similar). Can't we just use the single pushkey device.pushkey, made into a list of length 1 if needbe?

DMRobertson commented 2 years ago

By the time we're in the latter function, I think it should always be the case that the device argument to _dispatch_notification_unlimited should always satisfy device.app_id == self.name.

Hmm. Is this assumption correct? I suppose it's possible that the pushkin's name contained a *, and so so the appid wasn't an exact match but a glob/regex match:

https://github.com/matrix-org/sygnal/blob/ff1e98ed494cf388606c4dd33eb38a0f67593cc2/sygnal/http.py#L205-L227

clokep commented 2 years ago

If we are dispatching to just a single device, I don't understand why we are even looking at n.devices in the first place (and I couldn't see other pushkins doing something similar). Can't we just use the single pushkey device.pushkey, made into a list of length 1 if needbe?

I believe the protocol allows you to send multiple devices at once to the push server, but Synapse doesn't do this. Sygnal seems to support it (or try to, it probably doesn't get much testing since Synapse doesn't support it)!

reivilibre commented 2 years ago

I think you've spotted something inconsistent here.

For context, the GCM pushkin used to (still does?) pull out all devices that match its app ID, and it would deliver them to GCM in one go.

It looks like we have since introduced app ID globbing. (I was not aware that this had been added.) It looks like that may be the cause of this problem (because it interacts badly with the GCM pushkin's attempt to deliver all at once, assuming we use it (and somehow didn't notice that it doesn't work — ?).

squahtx commented 2 years ago

In this particular case, the device's app ID is com.im.vector.app.android, while the pushkin's app ID is im.vector.app.android.

There are a few things that aren't quite right here:

  1. find_pushkins treats .s in pushkin app IDs as wildcards instead of literal .s. sygnal.utils.glob_to_regex looks handy.
  2. find_pushkins matches any substring of the device app ID, rather than the whole app ID
    • But fixing this would mean notifications for the com.im.vector.app.android app ID wouldn't work?
      • Notifications for the com.im.vector.app.android app ID currently don't work anyway!
      • Where did com.im.vector.app.android come from? It's im.vector.app.android in the current Element Android app.
  3. GcmPushkin._dispatch_notification_unlimited is not glob-aware, as @reivilibre points out. sygnal.utils.glob_to_regex looks handy.
  4. The way GcmPushkin tries to do a batch send from within a non-batch API seems quite hacky? Perhaps some API redesign/refactoring is needed?
squahtx commented 2 years ago

Fixed by #281, #269 and #270.