ministryofjustice / operations-engineering

This repository is home to the Operations Engineering's tools and utilities for managing, monitoring, and optimising software development processes at the Ministry of Justice. • This repository is defined and managed in Terraform
https://user-guide.operations-engineering.service.justice.gov.uk/
MIT License
14 stars 5 forks source link

✨ Experiment to automate collection of user activity currently only available from GitHub Dormant User Manual Download #4790

Open PepperMoJ opened 2 months ago

PepperMoJ commented 2 months ago

User Need

As a Ops Engineering Team I want an automated process to identify dormant GitHub users that mimics the GitHub Dormant user download so that I (as an Enterprise owner) don't have to run the manual process to download the GitHub Dormant user csv file.

Value Automation over a manual process.

Functional Requirements (What):

Non-Functional Requirements (How):

For each User across both Orgs we need the date of their most recent activity last_active_date (for activity that counts as being non-dormant 🤯). Then we can say that those with a last_active_date date before a date are considered dormant (as a direct User of GitHub)- this information may be combined with Auth0 to determine if they have logged into the AP - which counts as indirect GitHub User activity.

The GitHub REST API facilitates endpoints from a repository perspective, so for each repository we can collect the most recent commit date of its contributors. We may need other repository interaction information - but commits is a place to start. Other activities: Pull request comments, issue comments, issues (creation?).

Start with a list of all GitHub users and repos. Create an empty dataframe of user and last_active_date. Cycle through each repo collecting the dates of that repo's contributors actions and updating the dataframe with the date if it is later than the one already there.

Diagram

Assumptions

Acceptance Criteria:

Notes

We know about https://github.com/peter-murray/inactive-users-action (action for identifying inactive users that mimics the GitHub Dormant User manual download), but we know that it works by checking each User in each Repository, so it makes number_of_users * number_of_repos calls; this will time out for our Orgs (>2000 repos).

I think that the GitHub Dormant User Download (and the Peter Murray GH Action) collect more information that simply whether or not a user is active. We don't need the details to decide dormancy; we just need the when.

tamsinforbes commented 1 week ago

Consider similar solution to removing outside collaborators using GraphQL query and affiliation. Remove stale outside collaborators GitHub Service get_stale_outside_collaboratoes get_paginated_list_of_unlocked_unarchived_repos_and_their_first_100_outside_collaborators

GraphQL docs

ENUMS Collaborator Affiliation: DIRECT affiliation: All collaborators with permissions to an organization-owned subject, regardless of organization membership status.

ENUMS ContributionLevel

ENUMS CommitContributionOrderField

Interface Contribution Fields include occurredAt - when the contribution was made and user who did it. Implemented by these types of contribution: CreatedCommitContribution CreatedIssueContribution CreatedPullRequestContribution CreatedPullRequestReviewContribution CreatedRepositoryContribution JoinedGitHubContribution RestrictedContribution

tamsinforbes commented 1 week ago

Consider using git log --since={YYYY-MM-DD} where YYYY-MM-DD is 90 days ago, and filter this to just merge commits with --merges; git log --since={YYYY-MM-DD} --merges

Output gives author and date of commit; get set of authors in 90 days for each repo. Final set of sets is the set of all active users, subtract from all users - remove the diff.

commit 34...............bd
Merge: 2d....e6 26....ef
Author: authorName <author.name@email.com>
Date:   Mon Oct 21 15:58:25 2024 +0100

    :emoji: Commit message 

Author name is not consistent; might be GitHub handle, Neither is email address; might be actual email, or github.com noreply email At least the GitHub noreply email address contains the GitHub handle 12345678+github-andle@users.noreply.github.com, could be 9 digits, but always a +.

Print first and last commit of each contributor to a repo

git log --pretty=format:"%ae %ai" | sort | awk 'contributor == $1 { lastContribution = $0 } contributor != $1 { contributor = $1; if (lastContribution) print lastContribution; print } END { print lastContribution }'

Print the last commit of each contributor to a repo since given date But the email is not always the same

git log --since=2024-10-01 --pretty=format:"%ae %ai" | sort -r | awk '!a[$1]++'

https://unix.stackexchange.com/questions/159695/how-does-awk-a0-work

tamsinforbes commented 5 days ago

Might be useful https://gist.github.com/raghavmittal101/0b4292e64d298fbc4213b29d221bd8dd

tamsinforbes commented 1 day ago

Peter Murrays action defines activity as commits, issues (creating), making issue or PR comments:

module.exports = {
  COMMITS: 'commits',
  ISSUES: 'issues',
  ISSUE_COMMENTS: 'issueComments',
  PULL_REQUEST_COMMENTS: 'prComments',
}
tamsinforbes commented 1 day ago

Eureka 🥳 pygithub - get number of commits for each contributor to a repo since a date (by github login - not git author name which varies)

Can extend this logic to include issue creations, issue commenting, PR creation etc

This should take less than number of repos number of users as a once a user is marked as active they are no longer checked. In fact its max limit is considerably less than this anyway: number of repos number of contributors to that repo (which is not the total user list as nobody contributes to every repo); and again it should be even less than this as if a contributor to multiple repos has already been checked and marked as active then they do not need to be checked again. This suggest ordering the repos by largest number of contributors and checking in that order first - though this may be overkill.