pubnub / dart

PubNub Dart SDK
Other
28 stars 15 forks source link

Handle more socket exceptions #65

Closed willbryant closed 2 years ago

willbryant commented 3 years ago

As per #63 unhandled exceptions are a pretty significant problem right now, so we've been trying to identify and add diagnostics for all the errors we see in production. It's common for us because our app is used for deliveries and drivers regularly experience temporary comms problems as they go around, particularly in less populated hilly areas.

Happily the library allows us to extend the built-in detection. But, I think everyone is going to have the same problems as us, so I'd like to suggest a much broader set of diagnostics.

Hoping that you would upstream them later, we started out by extending the list of exceptions using exactly the same exact-match methodology as the library itself, a bit like these ones:

  RegExp(r'^SocketException: OS Error: Software caused connection abort, errno = 103, address = ([a-zA-Z0-9\-\.]+), port = ([0-9]+)$'):
      (match) => HostIsDownDiagnostic(
          match?.group(1), int.tryParse(match?.group(2) ?? '')),
  RegExp(r'^SocketException: Connection failed \(OS Error: Host is down, errno = 64\), address = ([a-zA-Z0-9\-\.]+), port = ([0-9]+)$'):
      (match) => HostIsDownDiagnostic(
          match?.group(1), int.tryParse(match?.group(2) ?? '')),

This looked like this:

  RegExp(r'^SocketException: Write failed \(OS Error: Software caused connection abort, errno = 103\), address = ([a-zA-Z0-9\-\.]+), port = ([0-9]+)$'):
      (match) => HostIsDownDiagnostic(
          match?.group(1), int.tryParse(match?.group(2) ?? '')),
  RegExp(r'^SocketException: Read failed \(OS Error: Software caused connection abort, errno = 103\), address = ([a-zA-Z0-9\-\.]+), port = ([0-9]+)$'):
      (match) => HostIsDownDiagnostic(
          match?.group(1), int.tryParse(match?.group(2) ?? '')),
  RegExp(r'^SocketException: OS Error: Connection reset by peer, errno = 104, address = ([a-zA-Z0-9\-\.]+), port = ([0-9]+)$'):
      (match) => HostIsDownDiagnostic(
          match?.group(1), int.tryParse(match?.group(2) ?? '')),
  RegExp(r'^SocketException: OS Error: Connection refused, errno = 61, address = ([a-zA-Z0-9\-\.]+), port = ([0-9]+)$'):
      (match) => HostIsDownDiagnostic(
          match?.group(1), int.tryParse(match?.group(2) ?? '')),
  RegExp(r'^SocketException: OS Error: No route to host, errno = 113, address = ([a-zA-Z0-9\-\.]+), port = ([0-9]+)$'):
      (match) => HostIsDownDiagnostic(
          match?.group(1), int.tryParse(match?.group(2) ?? '')),
  // see https://developer.apple.com/library/archive/technotes/tn2277/_index.html#//apple_ref/doc/uid/DTS40010841-CH1-SUBSECTION3 for this one - this is expected unlike a normal Unix EBADF:
  RegExp(r'^SocketException: OS Error: Bad file descriptor, errno = 9, address = ([a-zA-Z0-9\-\.]+), port = ([0-9]+)$'):
      (match) => HostIsDownDiagnostic(
          match?.group(1), int.tryParse(match?.group(2) ?? '')),
  RegExp(r'^SocketException: OS Error: Connection timed out, errno = 110, address = ([a-zA-Z0-9\-\.]+), port = ([0-9]+)$'):
      (match) => HostIsDownDiagnostic(
          match?.group(1), int.tryParse(match?.group(2) ?? '')),
  RegExp(r'^SocketException: Connection failed \(OS Error: Network is unreachable, errno = 101\), address = ([a-zA-Z0-9\-\.]+), port = ([0-9]+)$'):
      (match) => HostIsDownDiagnostic(
          match?.group(1), int.tryParse(match?.group(2) ?? '')),
  RegExp(r"^SocketException: Failed host lookup: '([a-zA-Z0-9\-\.]+)' \(OS Error: nodename nor servname provided, or not known, errno = 8\)$"): (match) =>
      HostLookupFailedDiagnostic(match?.group(1)),
  RegExp(r"^SocketException: Failed host lookup: '([a-zA-Z0-9\-\.]+)' \(OS Error: No address associated with hostname, errno = 7\)$"): (match) =>
      HostLookupFailedDiagnostic(match?.group(1)),
  RegExp(r'^SocketException: OS Error: Operation timed out'): (match) => UnknownHttpExceptionDiagnostic(),
  RegExp(r'^HttpException: Write failed, uri ='): (match) =>
      UnknownHttpExceptionDiagnostic(),
  RegExp(r'^HttpException: Connection closed before full header was received, uri ='): (match) =>
      UnknownHttpExceptionDiagnostic(),
  RegExp(r'^HandshakeException: Connection terminated during handshake$'): (match) =>
      UnknownHttpExceptionDiagnostic(),
  RegExp(r'^HandshakeException: Handshake error in client'): (match) =>
      UnknownHttpExceptionDiagnostic(),

Already you can see the list is getting pretty large. But in prod we found that we were getting many of the same errors all over again - except that they had different OS errno codes.

This is because Android runs on a Linux-derived kernel and iOS on a Darwin-derived kernel. These two have both used the BSD error constants but the values are different. So I started out by adding the two variants:

  RegExp(r'^SocketException: Write failed \(OS Error: Software caused connection abort, errno = (103|53)\), address = ([a-zA-Z0-9\-\.]+), port = ([0-9]+)$'):
      (match) => HostIsDownDiagnostic(
          match?.group(1), int.tryParse(match?.group(2) ?? '')),

But in general, these are compile-time constants, but are not portable across operating systems. IMHO it makes no sense to hardcode them. I think there's no need to bring this in, so I suggest moving to this style:

  RegExp(r'^SocketException: Write failed \(OS Error: Software caused connection abort, errno = [0-9]+\), address = ([a-zA-Z0-9\-\.]+), port = ([0-9]+)$'):
      (match) => HostIsDownDiagnostic(
          match?.group(1), int.tryParse(match?.group(2) ?? '')),
  RegExp(r'^SocketException: Read failed \(OS Error: Software caused connection abort, errno = [0-9]+\), address = ([a-zA-Z0-9\-\.]+), port = ([0-9]+)$'):
      (match) => HostIsDownDiagnostic(
          match?.group(1), int.tryParse(match?.group(2) ?? '')),
  RegExp(r'^SocketException: OS Error: Connection reset by peer, errno = [0-9]+, address = ([a-zA-Z0-9\-\.]+), port = ([0-9]+)$'):
      (match) => HostIsDownDiagnostic(
          match?.group(1), int.tryParse(match?.group(2) ?? '')),
  RegExp(r'^SocketException: OS Error: Connection refused, errno = [0-9]+, address = ([a-zA-Z0-9\-\.]+), port = ([0-9]+)$'):
      (match) => HostIsDownDiagnostic(
          match?.group(1), int.tryParse(match?.group(2) ?? '')),
  RegExp(r'^SocketException: OS Error: No route to host, errno = [0-9]+, address = ([a-zA-Z0-9\-\.]+), port = ([0-9]+)$'):
      (match) => HostIsDownDiagnostic(
          match?.group(1), int.tryParse(match?.group(2) ?? '')),
  // see https://developer.apple.com/library/archive/technotes/tn2277/_index.html#//apple_ref/doc/uid/DTS40010841-CH1-SUBSECTION3 for this one - this is expected unlike a normal Unix EBADF:
  RegExp(r'^SocketException: OS Error: Bad file descriptor, errno = 9, address = ([a-zA-Z0-9\-\.]+), port = ([0-9]+)$'):
      (match) => HostIsDownDiagnostic(
          match?.group(1), int.tryParse(match?.group(2) ?? '')),
  RegExp(r'^SocketException: OS Error: Connection timed out, errno = [0-9]+, address = ([a-zA-Z0-9\-\.]+), port = ([0-9]+)$'):
      (match) => HostIsDownDiagnostic(
          match?.group(1), int.tryParse(match?.group(2) ?? '')),
  RegExp(r'^SocketException: Connection failed \(OS Error: Network is unreachable, errno = [0-9]+\), address = ([a-zA-Z0-9\-\.]+), port = ([0-9]+)$'):
      (match) => HostIsDownDiagnostic(
          match?.group(1), int.tryParse(match?.group(2) ?? '')),
  RegExp(r"^SocketException: Failed host lookup: '([a-zA-Z0-9\-\.]+)' \(OS Error: nodename nor servname provided, or not known, errno = 8\)$"): (match) =>
      HostLookupFailedDiagnostic(match?.group(1)),
  RegExp(r"^SocketException: Failed host lookup: '([a-zA-Z0-9\-\.]+)' \(OS Error: No address associated with hostname, errno = 7\)$"): (match) =>
      HostLookupFailedDiagnostic(match?.group(1)),
  RegExp(r'^SocketException: OS Error: Operation timed out'): (match) => UnknownHttpExceptionDiagnostic(),
  RegExp(r'^HttpException: Write failed, uri ='): (match) =>
      UnknownHttpExceptionDiagnostic(),
  RegExp(r'^HttpException: Connection closed before full header was received, uri ='): (match) =>
      UnknownHttpExceptionDiagnostic(),
  RegExp(r'^HandshakeException: Connection terminated during handshake$'): (match) =>
      UnknownHttpExceptionDiagnostic(),
  RegExp(r'^HandshakeException: Handshake error in client'): (match) =>
      UnknownHttpExceptionDiagnostic(),

But although incrementally better, I think this is still an unsustainable approach, for a few reasons:

  1. Looking at the list of OS errors you can get from socket functions, there's still a bunch that I haven't handled in the list, which I assume are rarer but still possible
  2. I'm still seeing some exceptions in production that look like they're the same underlying error, but they have a different string at the start, for example "RequestOtherException: request failed (SocketException: OS Error: Network is unreachable, errno = 101, address = ps.pndsn.com, port = 33008)"
  3. Even some of those errors that in C would clearly indicate a programming error rather than a network fault, such as EBADF, are actually used for normal error conditions now (iOS uses them for connections aborted because the application was backgrounded).
  4. I suspect that if the app is running in a non-C locale, the error messages will be localised. I haven't tested whether the Android userland, iOS userland, or Dart VM themselves actually do run the app in non-C locales (and we won't see it in prod for our app as our drivers are all using English), but it is certainly a problem in C programs in general - you have to use the error codes rather than rely on the messages.

So, long story short, I don't think it makes sense to keep playing whack-a-mole and coding in these detailed exception strings.

Is there any reason not to just capture all SocketException, HttpException, and HandshakeException errors and diagnose them all as temporary?

Something like this:

  RegExp(r'^SocketException: .* address = ([a-zA-Z0-9\-\.]+), port = ([0-9]+)$'):
      (match) => HostIsDownDiagnostic(
          match?.group(1), int.tryParse(match?.group(2) ?? '')),
  RegExp(r"^SocketException: Failed host lookup: '([a-zA-Z0-9\-\.]+)'"):
      (match) => HostLookupFailedDiagnostic(match?.group(1)),
  RegExp(r'^SocketException:'): (match) => UnknownHttpExceptionDiagnostic(),
  RegExp(r'^HttpException:'): (match) => UnknownHttpExceptionDiagnostic(),
  RegExp(r'^HandshakeException:'): (match) => UnknownHttpExceptionDiagnostic(),

Perhaps even use normal exception class tests instead of string matching?

willbryant commented 3 years ago

Also getting some exceptions like this:

RequestOtherException: request failed (OS Error: Software caused connection abort, errno = 103)

Note there's no "SocketException:" at the start of the bit in brackets - compare it to these ones:

RequestOtherException: request failed (SocketException: OS Error: Software caused connection abort, errno = 103, address = ps.pndsn.com, port = 46882)

RequestOtherException: request failed (SocketException: Read failed (OS Error: Software caused connection abort, errno = 103), address = ps.pndsn.com, port = 49218)

Same app version and Android version. No idea why that's coming through differently, but it's making it hard to match properly.

are commented 3 years ago

Hi there! Thanks again for such an extensive insight into the issues you report! It doesn't happen often that we get such detailed and in-depth issue reports. I really (really!) appreciate it!

So a little bit of history to explain existing diagnostics - during development and testing I've noticed that there are tons of different networking exceptions that the Dart SDK reports and it was hard to find and cover all matching cases. We decided to add some initial ones and hope that with time and people reporting issues we would extend this (so the whack-a-mole was kind of planned).

What we didn't expect is the sheer amount of different types of network failures that the system reports and no standardized way that Dart presents them to us, so clearly we need a better solution for this. What you've done in the last example is closer to what we could do.

Additionally, in your own code you can totally match exception classes using type system and type checks instead of string matching - that is actually faster and more reliable since you can just check if an exception is HttpException or SocketException and so on (and those actually have fields that you can further use). We couldn't do that directly in the SDK - as an SDK we are platform independent and when those diagnostics were written we were relying on an external networking library (dio) to handle requests. Because of this we couldn't import exception class types from dart:io because that would limit our ability to publish the code as JavaScript compatible. But in your own application code you can handle that easier so I think its totally feasible and worthwile for you to just check for class types instead.

About the second part - I have no idea why those exceptions are sometimes "wrapped" - I assume this is because of how the underlying system tries to handle those exceptions - in some cases it may try to resolve it on its own without passing the exception up but it still fails so it wraps it and propagates further - no idea. In any case, matching by class should be enough to solve this part of the issue!

Let me know if you have any further questions. I'm going to leave this issue open and pin it so other people can take a look at this, report more exception types. I'm also adding it to feature requests so we can handle those exceptions better by default.

Thanks again!

github-actions[bot] commented 2 years ago

@willbryant this issue is addressed in v4.0.0