Project-wide error reporting solution

heyhippari commented 3 years ago

This has been the subject of (very heated) team discussions before, but I would like to officially bring it to the table in a public fashion, as I feel this is an important thing for client quality going forwards, especially with things like Roku, WebOS, etc.

The problem

Some of our clients don't have facilities for error reporting or getting logs. The biggest offenders are, of course, TV-based clients (Roku, WebOS, Tizen, etc).

This can lead to both quality issues and frustrations for both developers and users:

It's hard to debug those platforms.
Users often don't know how to debug those platforms, needing to make custom builds or enable developer options, setup SDKs and whatnot.
Developers have to spend a lot of time guiding users to help them get proper debugging information
Issues are often lackluster in information density and quality for bugs, leading to them being mostly ignored because impossible to reproduce.

In a more general sense, the quality of the majority of issues has been lackluster, with no/wrong logs attached, lacking information, etc. This makes it difficult to figure out duplicates, properly triage and, most importantly, fix them.

The current state of things

At least Android TV is using a custom solution for crash reporting already (Seemingly ACRA?) (~~Which I think is enabled by default, looking at the code, but I'm not an Android user or developer, so this might be wrong~~ It is in fact opt-in, my bad).

This leaves some parts of the project in a weird spot: we don't have a way to get errors or crashes, and each client would have to roll its own solution if we want that (For example, we've made an on-screen developer console for WebOS before).

In my opinion, this is sub-optimal for a few reasons:

It leads to more infrastructure If every project rolls with their own platform or way of doing things, we may quickly end up with 5-6 different error reporting solutions or ways of getting proper reports.
It's a bad user experience If a user reports a bug and we spend 6 hours getting them to install a WebOS SDK, setup Node.js, enable the on-screen console and building+deploying the clients, it's a bad experience for everybody involved
Software quality suffers Without a good way of getting error reports, we'll constantly have quality issues. Stuff will break, users will be angry, we'll be frustrated and, in general, the project will be worse for it. Yes, GitHub issues exist. No, most people will never fill a GitHub issue properly, or even at all.

The proposed solution

I have pushed this a lot internally, across a lot of very heated discussions, but I still think having a project-wide error reporting solution would be a great benefit to the project as a whole.

One idea was to add an API endpoint to allow clients to report errors to the server. As was discussed on team channels, and agreed with by a lot of team members, this is an impractical solution for a few reasons:

It's not the server's job to handle client errors
It still puts the onus of error reporting on the server admin. We'd like to avoid that, because users may not always be admins or even have regular/cooperating admins
A large number of clients reporting a lot of errors could bring the server down or impact performance

So, essentially, that solution doesn't solve any of the parts of the problems, along with seeming a bit half-assed and suboptimal.

The best solution, in my opinion, would be to adopt a proven error-reporting framework, such as Sentry or GlitchTip (Note: there might be other platforms/frameworks and this isn't an endorsement of either).

This gives us one thing to keep track of in the entire project, no matter the client or the technology (At least both of the above solutions use the open source Sentry reporters, and are available for pretty much everything, or fairly easy to implement if they're not).

The raised issues

During internal discussions, a few issues were raised about this proposal:

Privacy

Obviously, one of our strong points is that we don't have centralized infrastructure and don't phone back home (At least in a way that can't be disabled. See plugins as an example of us having a centralized service that phones back to us, which users can disable).

Concern was raised that error/crash reporting would be seen as invading privacy and going against one of our main goals.

I think this is seeing the issue in a very narrow way. Not all crash reporting is negative or invades privacy. We can easily sidestep the "no phoning home" issue by making it opt-in (if this is really something the community at large cares about).

I would wager most people have no issues with crash reporting if it is done in a transparent, non-invasive way, and it opt-in instead of opt-out (~~But even then, Android TV's crash reporting seems to be opt-out, and we have had exactly 0 complaints against that, including in the team~~ Android TV is opt-in).

More infrastructure

Yes, this would probably require some amount of work from the team members handling infrastructure, if we go with the preferred self-hosting route.

I would argue that the benefit outweighs it, and I don't think our infra is really a large amount of work at the moment, for the most part.

We don't run a lot currently. Essentially (if I'm not missing anything):

Our website is hosted by Github Pages
So is our documentation
We host Mirrorbits and a few file servers around the world
We host Weblate
We host Feathub
We have a reverse proxy for all of it

I don't think it'd be too much to add one more thing to it, especially since it brings a lot of potential improvements to the project. Other open source projects of our size have a lot more infrastructure than we do, and do just fine.

Conclusion

A project-wide crash reporting solution would be a net benefit for both users and contributors/team members. It has the potential to improve our client quality, reduce user friction, improve the quality of issues/bug reports and help with triage (which has been a sore issue lately).

Valid issues were raised, which warrant a wider discussion and, while the project leader was strongly opposed to this, I really think it would be a bad call for the project in the mid-to-long term to be so strict about this.

I would like to ask for comments from both the team at large, and our community (both users and contributors), and push this discussion forward a bit.

nielsvanvelzen commented 3 years ago

The current implementation for Android TV is using ACRA and tracepot.com as a backend. The reporting is opt-in and by default shows a dialog to the user to ask if we're allowed to upload crash logs.

There are a few issues with our current implementation:

tracepot.com is EOL (end of life), a direct quote from the website:

We no longer accept new registrations. We make our best effort to keep the service running for existing users but you may consider this service as abandoned and may shutdown without notice.
The dialog does not work properly with dpad controls, making it hard for a user to upload the crashlogs or close the dialog
The dialog can disappear before the user takes actions due too it being shown in the StartupActivity which forwards to the MainActivity after ~3 seconds when opening the app.
The crash reports don't always contain enough information, when a user is playing media it would be nice to add information about that media (codecs, used video player etc.)
It's not possible to view all crash reports per device in tracepot, sometimes a crash arises from a previous crash so viewing crashes per device would help with this

The endpoint in 10.8 would allow us to upload stuff to the server which makes it easier for users to view and share the crash reports but like you said it has it's limitations. One additional issue I see with it is the fact that the app needs an active server+user session. If a crash happens in our login code or similar we wouldn't be able to report it.

I'm in favor of using a single, self-hosted, tool for crash reporting but it should be opt-in and we might want to allow servers to disable it for all their users. An requirement for a potential tool is that it needs open-source libraries for the clients so we can still publish to F-Droid (which does not allow proprietary code in apps).

Useful links:

heyhippari commented 3 years ago

Sentry's reporters (Used by both Sentry itself and GlitchTip) are all MIT or BSD-2-clause licensed and fully open source:

https://github.com/getsentry/sentry-java (There's a separate experimental Kotlin Muiltiplatform one)
https://github.com/getsentry/sentry-javascript (There's a separate browser one)
https://github.com/getsentry/sentry-python
https://github.com/getsentry/sentry-dotnet
https://github.com/getsentry/sentry-react-native
https://github.com/getsentry/sentry-native (C/C++)
https://github.com/getsentry/sentry-cocoa (With support for iOS, watchOS, tvOS and macOS)

I think they shouldn't prevent publishing to Fdroid?

mcarlton00 commented 3 years ago

I don't think it will be possible to replicate this, but as another data point I'll throw out how Kodi handles this. They have a separate addon that's used to upload the current log file to an endpoint. https://kodi.wiki/view/Log_file/Easy

I don't see any way that we could copy this exact behavior for all of our various clients, but it may give somebody else another idea of how to make something work for our stuff.

iwalton3 commented 3 years ago

I know there are some applications that have an "upload logs" button that will upload the logs and possibly give the user a link they can share for diagnostics. Often for the clients I maintain, I have to instruct people on how to find and upload the logs, and it can be annoying at times.

One challenge is some error cases prevent the application from launching, which would prevent access to the log upload feature. Also, the logs do have API keys removed by default, but they do not censor server addresses which some users want to remove, which could be a concern.

I think an automatic crash reporting dialog is fine as long as:

The user gets a confirmation before anything is sent
The user can view whatever would be sent before sending it

I am aware that there are some platforms where is is impossible to satisfy that criteria, which may require further discussion.

heyhippari commented 3 years ago

I don't see any way that we could copy this exact behavior for all of our various clients, but it may give somebody else another idea of how to make something work for our stuff.

The main issue is, as @iwalton3 mentions, some errors prevent the launch at all and some errors may happen before any server is selected.

One challenge is some error cases prevent the application from launching, which would prevent access to the log upload feature. Also, the logs do have API keys removed by default, but they do not censor server addresses which some users want to remove, which could be a concern.

The Sentry frameworks (even with GlitchTip) handle this by being setup at the very very start of the application and intercepting every unhandled error.

The SDKs also have data scrubbing facilities, for example: https://docs.sentry.io/platforms/javascript/data-management/sensitive-data/#scrubbing-data

With this, we can remove sensitive information before its even sent to us.

I think an automatic crash reporting dialog is fine as long as

Both of these are doable, reporting doesn't have to be automatic.

crobibero commented 1 year ago

This is available in 10.8+ with the POST /ClientLog/Document endpoint

https://api.jellyfin.org/#tag/ClientLog/operation/LogFile

jellyfin / jellyfin-meta