uNetworking / uSockets

Miniscule cross-platform eventing, networking & crypto for async applications
Apache License 2.0
1.29k stars 268 forks source link

Expose error codes #61

Closed snej closed 4 years ago

snej commented 5 years ago

The API does not expose the details of an error, e.g. error codes.

Many use cases need to know the exact reason for a connection failure, for recovery and logging and to inform the user. (The library I work on uses these codes to decide whether to retry a connection. And my employer’s support team is often found poring over error logs submitted by developers using our library, to help them figure out why the client won’t connect to the 🤬 server.)

Error codes are problematic because there are multiple namespaces (errno, gethostbyname status, OpenSSL, mbedTLS, Apple OSStatus, etc.) What I tend to do is define an enumeration covering all the namespaces, then make my error type a {namespace, code} tuple. But defining your own namespace with an enum of your own error codes works too...

(This is another task I’m willing & likely to work on, if I end up choosing uWebSockets for this work project.)

ghost commented 5 years ago

25 Related,

TCP layer doesn't have errors because out of 19 different return values, only one or two may actually happen. Those one or two different actual errors are essentially ECONNRESET; something you can catch by checking if you get a proper on_end callback before on_close.

So even if you would use the Linux syscalls directly, there never were a way to get "the exact reason". And there is no need to.

Sure, as bug 25 reports, we have issues with connect errors and yes they need to be improved. But even here there are only a handful of actual errors.

TLS doesn't have errors because TCP doesn't. You kind of have the same reasoning here: if you do not get on_end but only get on_close thensomething went wrong either protocol error or reset.

Now, the Linux kernel can see protocol errors for TCP just like openssl can see protocol errors for TLS. Linux will just report all protocol errors as ECONNRESET or EPOLLHUP or any such generic failure.

You'll never get super detailed errors such as "TCP handskar failed".

Therefore this lib makes it I've of the design goals not to emit pointless errors nobody understands anyways. You either get on_end for a graceful non-error shutdown or you get only on_close when something went wrong.

Use strace to log and debug.

ghost commented 5 years ago

Under key aspects on readme:

Minimal yet truly cross-platform, will not emit a billion different platform specific error codes.

ghost commented 5 years ago

You can do what you want in the plug-in, if you want to log things do it. You could define a function like register_error_handler which takes a c function then just declare it in your program and use it. Nothing is stopping you from doing that, but the public API should not have detailed errors.

Maybe we can add an integer to on_close which may or may not hold extra info. Kind of like a reserved thing for use by those implementations who care

snej commented 5 years ago

You have a point about error codes being pretty much irrelevant when an open socket terminates. But when connecting, there are several distinct error conditions we [and probably many apps / higher level libraries] care about:

I'm not saying we have different logic for each one of these, but we use these as heuristics for whether to retry immediately, or treat the client as "offline" and wait for a network interface to go up, or give up and report an error to the user. (Some of this complexity comes from running on mobile devices, which find themselves in changing and unstable network conditions.)

Use strace to log and debug.

That sort of thing works when developers are testing in-house, but most troubleshooting is forensic, based on logs from customer devices (iOS and Android mostly.) Also, to be honest, a lot of our enterprise customers employ developers who are not super knowledgeable at anything below the app-framework layers of the OS, so it's for the best if we can tell them to just turn on our logging, instead of having to explain what system-tracing is and how to run it. (You may roll your eyes, but our support engineers are very emphatic about this, especially after a few beers.)

ghost commented 5 years ago

Connecting is in a pretty bad shape in this library - I've mainly focused on server side.

I agree connecting is a special case where you might want at least a few possible errors.

Problem is the API need to be symmetric and I do not want to introduce on_error as that makes less sense for server side.

What I've thought about is passing error to on_open. Why on_open? Why not on_close?

Because if you pass the error to on_close you cannot know if you were closed from open state or closed immediately, so you cannot know if you have run a proper on_open before.

So it could make sense to have on_open take an integer error or even uintptr_t or something like that. Then you basically have to check if it was opened without error or if you opened with an error.

Maybe it makes sense to rename on_open to on_try_open?

snej commented 5 years ago

if you pass the error to on_close you cannot know if you were closed from open state or closed immediately, so you cannot know if you have run a proper on_open before.

Can't you [the client code] set a flag in on_open and then test it in on_close? Same advice you gave above about detecting disconnects (just with on_open instead of on_end.)

snej commented 5 years ago

I've experimentally added the int error parameter to on_open. IMHO, on_open is looking pretty messy since it has some parameters only for server-side and another only for client-side, and a flag to tell which mode — to me this looks like a strong code smell for splitting it into two callbacks, one for client and one for server. But you've mentioned a necessity of keeping the API symmetrical, which I don't understand but I'll take it as a given.

ghost commented 5 years ago

Both client and server get the IP and IP length. Is_client is kind of a helper, you could have two different contexts if you want to separate them

ghost commented 5 years ago

Hmm maybe not

ghost commented 5 years ago

Also closing due to inactivity

ghost commented 5 years ago

@snej I know you're not interested anymore but this is how I want to solve it:

Timeout errors are up to you to track with us_socket_timeout calls

snej commented 5 years ago

Sorry I vanished — I decided that for our architecture we don't need to do multiplexed or async socket I/O, since this is a client that usually only opens one socket and will never open more than a handful. So I've gone with a simpler approach that just spawns a read thread and a write thread, instead of using more complex 3rd party libraries with async I/O.

ghost commented 4 years ago

The API now has int code and void *reason to its on_close and close functions. This allows passing TCP/TLS error codes 👍