Open Stebalien opened 7 years ago
For Github I would go the path of subscribing to all webhooks and the writing something that would filter and pull out statistics from this data.
This would allow us to in future compare how we improved according to different metrics.
I think something that queries the API (it's possible to download issues+comments using the API) as needed may be simpler (no online infrastructure).
It can be even a heroku app. It would just log the events, the infrastructure is minimal and it gives us much better introspect. GH API has 1000req/h limit so collecting all this data on demand could take a while.
@Kubuxu Maybe. However I don't think we usually have >1000 issues opened per quarter (although I may be wrong).
I don't think there is no good way to filter them and also you have to query each repo separately.
An example of prior art in "github analytics": https://github.com/StephenOTT/GitHub-Analytics, it seems to have a concept of Org-wide aggregation (a good PoC of what is possible with GH API):
As for Discourse, we could have a script that identifies threads that are X days old and have no response, and have a bot e.g. on #ipfs that surfaces such questions. Discourse already tracks "Time to First Response" since around 2015: https://meta.discourse.org/t/time-to-first-response-definition-please/30957
The https://github.com/StephenOTT/GitHub-Analytics looks awesome but might be a bit out of date.
Bit more up-to-date approach would something like https://github.com/grafana/github-to-es:
This setup should be quite flexible: data aggregation is separate from visualization, aggregation can be incremental. Grafana enables us to create custom dashboards, and we can extend it with various sources of community data (eg. display github and discourse on the same graph for comparison). Negative side of this approach is that it requires some additional development to get what we want.
We also need a way to keep track of issues needing responses.
I think getting all of our github data into elastic search would give us the ability to easily make queries like "issues with no response for a week"
I'll just dump what I have previously written locally on this subject. (more general open source metrics rather than just track response times, might be useful for the future)
@mikeal pinging you to add this thread to your radar :)
Thanks @diasdavid! I wasn't aware of this thread and it's super useful.
I already spent some time trying to build out a lot of these metrics on top of the GitHub GraphQL API. I finally had to abandon that approach, there's just no way to get all the data we need for an org our size without blowing out the rate limit. It's also incredibly hard to build "syncing" because of the way they've done pagination.
Instead, I've been working on getting gharchive
into an IPLD graph. I'm building a sample set from Q1 2018 now so that we can start experimenting. Once that data is in a graph that we can pin, it will be much eachier to select parts of the graph for the ipfs org or specific repos and do whatever data analysis we want on it.
Some non-GitHub metrics I can speak to:
Number of mentions on Twitter
There are some pretty amazing products for managing social media presence. Once the communications team is spun up I would expect them to pick one of these and have the metrics from these products used in their OKR's.
Number of posts in Discourse
Discourse has an admin dashboard with a lot of great metrics. Some of which I think are more important than just "new posts" because we can actually track how many unique people have seen or engaged in existing posts.
A few comments on some of the GitHub metrics.
Ratio of new to closed issues
This is definitely something that I want to know but I'm resistant to considering it a health metric. There isn't anything inherently unhealthy with having open issues and the projects I've seen that optimize for closing issues (stale issue bots) do so without much regard for how that effects the casual contributors who may have engaged in that issue.
While I find it reasonable for some single maintainer projects to take a more aggressive and automated approach to this I don't think it's the best option for projects with a lot of maintainers and it's counter-productive for a project that is trying to retain contributors and turn them into maintainers.
Time to first response in issue/PR
By the time we meet at the Berlin dev meetings I'll have a proposed strategy for which projects the Community Ops org will target first. This is one of the metrics I think we should standardize and once a Community Engineer embeds in a project we should be dedicated to keeping to that standardized level.
Number of people joining the community Number of people still active in the community
These are the metrics that were blowing out the rate limit in the GraphQL API :) Once we have the gharchive
data this is one of the most important things I want to track. "How many unique people engaged in our projects," "How many unique people have engaged each month for 3 months," these are some of the most important things to track not just for community growth but because they are also a window into how well our user base is growing.
We did a experiment a while ago with a gh-hook-logger, filebeat, elasticsearch and kibana. I recently spin it up again to see if would work, and we did some experimental dashboards, this is one of them:
@mikeal let me know if it would be useful to your current endeavor and we can plan for a proper deployment of it, right now it's just a bunch of duct-taped hacks.
@nayafia, I imagine you might find this issue interesting
@mikeal also came across the migration API from GitHub which seems to be able to grab historical data https://developer.github.com/v3/migrations/
Maybe we could find a way of backing up all the exposed data into ELK stack as events somehow
@VictorBjelkholm this API looks ideal for a few of our use cases. It won't help with any realtime dashboards but for metrics it's perfect. I wonder if there will be any cap on how many migrations are you can create. I'd ideally love to see us pull down all of our orgs every night :)
@mikeal guess it depends on how we setup the collection. If we let's say start today with collection all the webhook actions from all orgs, we only need to have archives of the data up-until today, not after that since the rest will be collected from webhooks.
A few things we should investigate:
A tool made by us in the past - https://github.com/ipfs/project-repos
Someone recommended Refined GitHub recently and I'm loving it. Highly recommended for everyone who spends a lot of time on GitHub.
Another recommendation: https://www.gitprime.com/product/