SPEC: API observability

guenp commented 2 weeks ago

Brainstorm and discuss a SPEC to establish how to add instrumentation and telemetry for scientific python projects to gain insights into usage patterns. Currently, Python projects typically don't have any direct insights into how users interact with their library, what common errors they run into or which APIs are most (in)frequently used. The goal of this SPEC is to design a way to collect usage logs from users in a transparent, ethical and efficient way, with minimal impact to user experience, in order to provide project maintainers with useful metrics and insights into the performance, usability and common user errors of components (modules, functions, etc) in their library.

drammock commented 2 weeks ago

I'll drop a link here to popylar which is I think only giving stats on overall module usage (i.e., number of times imported), not granular detail about which classes or functions are called. But @arokem is going to be at the summit and might have some thoughts on more granular metrics, so probably worth cornering him for a chat about it!

Carreau commented 1 week ago

I will link to https://github.com/scientific-python/summit-2023/issues/17, we might want to pull out notes from scipy about 10 years ago. See as well https://github.com/Carreau/consent_broker I started to work on some time ago, and a related discussion https://github.com/pyOpenSci/software-peer-review/issues/183

betatim commented 5 days ago

Different but related: https://github.com/betatim/kamal - a tool you run locally over a code base to get statistics on what API of a particular library is being used. The idea is that people can run this themselves and report the stats they get to somewhere (central?). The stats that are collected are kind of easy to look at, the goal being that you can somewhat easily convince yourself that no unwanted information is being shared. The original use case I had in mind was organisations that have private code bases but want to help a project learn which parts of their API are being used. Another idea could be running this in the CI of your project and reporting back to some central place (e.g. scikit-learn and pandas run this in their CI and report back to matplotlib or numpy regarding which parts of their API they use).

scientific-python / summit-2024

SPEC: API observability #1