Open Source Health Metrics - Reduce and Track Response Times

Stebalien commented 7 years ago

Ensure we respond to all issues/questions within X days.
- On repos: ipfs/go-ipfs, ...? (we need a concrete list)
- On discuss.ipfs.io.
Track response times.
- For github, we can write a script that queries the API.
- For discourse, there's a "mean time to first response" metric we may be able to use (although this might require upgrading our discourse install?).

Kubuxu commented 7 years ago

For Github I would go the path of subscribing to all webhooks and the writing something that would filter and pull out statistics from this data.

This would allow us to in future compare how we improved according to different metrics.

Stebalien commented 7 years ago

I think something that queries the API (it's possible to download issues+comments using the API) as needed may be simpler (no online infrastructure).

Kubuxu commented 7 years ago

It can be even a heroku app. It would just log the events, the infrastructure is minimal and it gives us much better introspect. GH API has 1000req/h limit so collecting all this data on demand could take a while.

Stebalien commented 7 years ago

@Kubuxu Maybe. However I don't think we usually have >1000 issues opened per quarter (although I may be wrong).

Kubuxu commented 7 years ago

I don't think there is no good way to filter them and also you have to query each repo separately.

lidel commented 7 years ago

An example of prior art in "github analytics": https://github.com/StephenOTT/GitHub-Analytics, it seems to have a concept of Org-wide aggregation (a good PoC of what is possible with GH API):

As for Discourse, we could have a script that identifies threads that are X days old and have no response, and have a bot e.g. on #ipfs that surfaces such questions. Discourse already tracks "Time to First Response" since around 2015: https://meta.discourse.org/t/time-to-first-response-definition-please/30957

Kubuxu commented 7 years ago

The https://github.com/StephenOTT/GitHub-Analytics looks awesome but might be a bit out of date.

lidel commented 7 years ago

Bit more up-to-date approach would something like https://github.com/grafana/github-to-es:

This setup should be quite flexible: data aggregation is separate from visualization, aggregation can be incremental. Grafana enables us to create custom dashboards, and we can extend it with various sources of community data (eg. display github and discourse on the same graph for comparison). Negative side of this approach is that it requires some additional development to get what we want.

Stebalien commented 7 years ago

We also need a way to keep track of issues needing responses.

whyrusleeping commented 7 years ago

I think getting all of our github data into elastic search would give us the ability to easily make queries like "issues with no response for a week"

victorb commented 7 years ago

I'll just dump what I have previously written locally on this subject. (more general open source metrics rather than just track response times, might be useful for the future)

Metrics to track

Activity
- Number of commits
- Number of PRs
- Number of issues opened and closed
- Number of mentions on Twitter
- Number of posts in Discourse
- Total number of opened issues/PRs
- Types of contributions
Size of activity
- Number of active contribtors (define active)
Performance
- Time to first response in issue/PR
- Time to resolution in issue/PR
- Ratio of new to closed issues
- Number of long-time open PRs and issues
Retention (per year/month)
- Number of people joining the community
- Number of people still active in the community
Diversity
- Pony Factor - minimum number of developers performing 50% of the commits
- Elephant Factor - minimum number of companies whose employees perform 50% of the commits
- First time, casual and repeat contributors

References

Inspiration

daviddias commented 6 years ago

@mikeal pinging you to add this thread to your radar :)

mikeal commented 6 years ago

Thanks @diasdavid! I wasn't aware of this thread and it's super useful.

I already spent some time trying to build out a lot of these metrics on top of the GitHub GraphQL API. I finally had to abandon that approach, there's just no way to get all the data we need for an org our size without blowing out the rate limit. It's also incredibly hard to build "syncing" because of the way they've done pagination.

Instead, I've been working on getting gharchive into an IPLD graph. I'm building a sample set from Q1 2018 now so that we can start experimenting. Once that data is in a graph that we can pin, it will be much eachier to select parts of the graph for the ipfs org or specific repos and do whatever data analysis we want on it.

Some non-GitHub metrics I can speak to:

Number of mentions on Twitter

There are some pretty amazing products for managing social media presence. Once the communications team is spun up I would expect them to pick one of these and have the metrics from these products used in their OKR's.

Number of posts in Discourse

Discourse has an admin dashboard with a lot of great metrics. Some of which I think are more important than just "new posts" because we can actually track how many unique people have seen or engaged in existing posts.

A few comments on some of the GitHub metrics.

Ratio of new to closed issues

This is definitely something that I want to know but I'm resistant to considering it a health metric. There isn't anything inherently unhealthy with having open issues and the projects I've seen that optimize for closing issues (stale issue bots) do so without much regard for how that effects the casual contributors who may have engaged in that issue.

While I find it reasonable for some single maintainer projects to take a more aggressive and automated approach to this I don't think it's the best option for projects with a lot of maintainers and it's counter-productive for a project that is trying to retain contributors and turn them into maintainers.

Time to first response in issue/PR

By the time we meet at the Berlin dev meetings I'll have a proposed strategy for which projects the Community Ops org will target first. This is one of the metrics I think we should standardize and once a Community Engineer embeds in a project we should be dedicated to keeping to that standardized level.

Number of people joining the community Number of people still active in the community

These are the metrics that were blowing out the rate limit in the GraphQL API :) Once we have the gharchive data this is one of the most important things I want to track. "How many unique people engaged in our projects," "How many unique people have engaged each month for 3 months," these are some of the most important things to track not just for community growth but because they are also a window into how well our user base is growing.

victorb commented 6 years ago

We did a experiment a while ago with a gh-hook-logger, filebeat, elasticsearch and kibana. I recently spin it up again to see if would work, and we did some experimental dashboards, this is one of them:

@mikeal let me know if it would be useful to your current endeavor and we can plan for a proper deployment of it, right now it's just a bunch of duct-taped hacks.

miyazono commented 6 years ago

@nayafia, I imagine you might find this issue interesting

victorb commented 6 years ago

@mikeal also came across the migration API from GitHub which seems to be able to grab historical data https://developer.github.com/v3/migrations/

Maybe we could find a way of backing up all the exposed data into ELK stack as events somehow

mikeal commented 6 years ago

@VictorBjelkholm this API looks ideal for a few of our use cases. It won't help with any realtime dashboards but for metrics it's perfect. I wonder if there will be any cap on how many migrations are you can create. I'd ideally love to see us pull down all of our orgs every night :)

victorb commented 6 years ago

@mikeal guess it depends on how we setup the collection. If we let's say start today with collection all the webhook actions from all orgs, we only need to have archives of the data up-until today, not after that since the rest will be collected from webhooks.

mikeal commented 6 years ago

A few things we should investigate:

daviddias commented 6 years ago

A tool made by us in the past - https://github.com/ipfs/project-repos

mikeal commented 6 years ago

Someone recommended Refined GitHub recently and I'm loving it. Highly recommended for everyone who spends a lot of time on GitHub.

mikeal commented 6 years ago

Another recommendation: https://www.gitprime.com/product/

ipfs / community