jellyfin / jellyfin-meta

A repository to hold our roadmap, policies, and more.
25 stars 4 forks source link

Project-wide error reporting solution #20

Closed heyhippari closed 1 year ago

heyhippari commented 3 years ago

This has been the subject of (very heated) team discussions before, but I would like to officially bring it to the table in a public fashion, as I feel this is an important thing for client quality going forwards, especially with things like Roku, WebOS, etc.

The problem

Some of our clients don't have facilities for error reporting or getting logs. The biggest offenders are, of course, TV-based clients (Roku, WebOS, Tizen, etc).

This can lead to both quality issues and frustrations for both developers and users:

In a more general sense, the quality of the majority of issues has been lackluster, with no/wrong logs attached, lacking information, etc. This makes it difficult to figure out duplicates, properly triage and, most importantly, fix them.

The current state of things

At least Android TV is using a custom solution for crash reporting already (Seemingly ACRA?) (Which I think is enabled by default, looking at the code, but I'm not an Android user or developer, so this might be wrong It is in fact opt-in, my bad).

This leaves some parts of the project in a weird spot: we don't have a way to get errors or crashes, and each client would have to roll its own solution if we want that (For example, we've made an on-screen developer console for WebOS before).

In my opinion, this is sub-optimal for a few reasons:

The proposed solution

I have pushed this a lot internally, across a lot of very heated discussions, but I still think having a project-wide error reporting solution would be a great benefit to the project as a whole.

One idea was to add an API endpoint to allow clients to report errors to the server. As was discussed on team channels, and agreed with by a lot of team members, this is an impractical solution for a few reasons:

So, essentially, that solution doesn't solve any of the parts of the problems, along with seeming a bit half-assed and suboptimal.

The best solution, in my opinion, would be to adopt a proven error-reporting framework, such as Sentry or GlitchTip (Note: there might be other platforms/frameworks and this isn't an endorsement of either).

This gives us one thing to keep track of in the entire project, no matter the client or the technology (At least both of the above solutions use the open source Sentry reporters, and are available for pretty much everything, or fairly easy to implement if they're not).

The raised issues

During internal discussions, a few issues were raised about this proposal:

Privacy

Obviously, one of our strong points is that we don't have centralized infrastructure and don't phone back home (At least in a way that can't be disabled. See plugins as an example of us having a centralized service that phones back to us, which users can disable).

Concern was raised that error/crash reporting would be seen as invading privacy and going against one of our main goals.

I think this is seeing the issue in a very narrow way. Not all crash reporting is negative or invades privacy. We can easily sidestep the "no phoning home" issue by making it opt-in (if this is really something the community at large cares about).

I would wager most people have no issues with crash reporting if it is done in a transparent, non-invasive way, and it opt-in instead of opt-out (But even then, Android TV's crash reporting seems to be opt-out, and we have had exactly 0 complaints against that, including in the team Android TV is opt-in).

More infrastructure

Yes, this would probably require some amount of work from the team members handling infrastructure, if we go with the preferred self-hosting route.

I would argue that the benefit outweighs it, and I don't think our infra is really a large amount of work at the moment, for the most part.

We don't run a lot currently. Essentially (if I'm not missing anything):

I don't think it'd be too much to add one more thing to it, especially since it brings a lot of potential improvements to the project. Other open source projects of our size have a lot more infrastructure than we do, and do just fine.

Conclusion

A project-wide crash reporting solution would be a net benefit for both users and contributors/team members. It has the potential to improve our client quality, reduce user friction, improve the quality of issues/bug reports and help with triage (which has been a sore issue lately).

Valid issues were raised, which warrant a wider discussion and, while the project leader was strongly opposed to this, I really think it would be a bad call for the project in the mid-to-long term to be so strict about this.

I would like to ask for comments from both the team at large, and our community (both users and contributors), and push this discussion forward a bit.

nielsvanvelzen commented 3 years ago

The current implementation for Android TV is using ACRA and tracepot.com as a backend. The reporting is opt-in and by default shows a dialog to the user to ask if we're allowed to upload crash logs.

There are a few issues with our current implementation:

The endpoint in 10.8 would allow us to upload stuff to the server which makes it easier for users to view and share the crash reports but like you said it has it's limitations. One additional issue I see with it is the fact that the app needs an active server+user session. If a crash happens in our login code or similar we wouldn't be able to report it.

I'm in favor of using a single, self-hosted, tool for crash reporting but it should be opt-in and we might want to allow servers to disable it for all their users. An requirement for a potential tool is that it needs open-source libraries for the clients so we can still publish to F-Droid (which does not allow proprietary code in apps).

Useful links:

heyhippari commented 3 years ago

Sentry's reporters (Used by both Sentry itself and GlitchTip) are all MIT or BSD-2-clause licensed and fully open source:

I think they shouldn't prevent publishing to Fdroid?

mcarlton00 commented 3 years ago

I don't think it will be possible to replicate this, but as another data point I'll throw out how Kodi handles this. They have a separate addon that's used to upload the current log file to an endpoint. https://kodi.wiki/view/Log_file/Easy

I don't see any way that we could copy this exact behavior for all of our various clients, but it may give somebody else another idea of how to make something work for our stuff.

iwalton3 commented 3 years ago

I know there are some applications that have an "upload logs" button that will upload the logs and possibly give the user a link they can share for diagnostics. Often for the clients I maintain, I have to instruct people on how to find and upload the logs, and it can be annoying at times.

One challenge is some error cases prevent the application from launching, which would prevent access to the log upload feature. Also, the logs do have API keys removed by default, but they do not censor server addresses which some users want to remove, which could be a concern.

I think an automatic crash reporting dialog is fine as long as:

I am aware that there are some platforms where is is impossible to satisfy that criteria, which may require further discussion.

heyhippari commented 3 years ago

I don't see any way that we could copy this exact behavior for all of our various clients, but it may give somebody else another idea of how to make something work for our stuff.

The main issue is, as @iwalton3 mentions, some errors prevent the launch at all and some errors may happen before any server is selected.

One challenge is some error cases prevent the application from launching, which would prevent access to the log upload feature. Also, the logs do have API keys removed by default, but they do not censor server addresses which some users want to remove, which could be a concern.

The Sentry frameworks (even with GlitchTip) handle this by being setup at the very very start of the application and intercepting every unhandled error.

The SDKs also have data scrubbing facilities, for example: https://docs.sentry.io/platforms/javascript/data-management/sensitive-data/#scrubbing-data

With this, we can remove sensitive information before its even sent to us.

I think an automatic crash reporting dialog is fine as long as

Both of these are doable, reporting doesn't have to be automatic.

crobibero commented 1 year ago

This is available in 10.8+ with the POST /ClientLog/Document endpoint

https://api.jellyfin.org/#tag/ClientLog/operation/LogFile