Open PepperMoJ opened 2 months ago
Consider similar solution to removing outside collaborators using GraphQL query and affiliation. Remove stale outside collaborators GitHub Service get_stale_outside_collaboratoes get_paginated_list_of_unlocked_unarchived_repos_and_their_first_100_outside_collaborators
ENUMS Collaborator Affiliation: DIRECT affiliation: All collaborators with permissions to an organization-owned subject, regardless of organization membership status.
ENUMS ContributionLevel
ENUMS CommitContributionOrderField
Interface Contribution
Fields include occurredAt
- when the contribution was made and user
who did it. Implemented by these types of contribution: CreatedCommitContribution
CreatedIssueContribution
CreatedPullRequestContribution
CreatedPullRequestReviewContribution
CreatedRepositoryContribution
JoinedGitHubContribution
RestrictedContribution
Consider using git log --since={YYYY-MM-DD}
where YYYY-MM-DD
is 90 days ago, and filter this to just merge commits with --merges
; git log --since={YYYY-MM-DD} --merges
Output gives author and date of commit; get set of authors in 90 days for each repo. Final set of sets is the set of all active users, subtract from all users - remove the diff.
commit 34...............bd
Merge: 2d....e6 26....ef
Author: authorName <author.name@email.com>
Date: Mon Oct 21 15:58:25 2024 +0100
:emoji: Commit message
Author name is not consistent; might be GitHub handle,
Neither is email address; might be actual email, or github.com noreply email
At least the GitHub noreply email address contains the GitHub handle 12345678+github-andle@users.noreply.github.com
, could be 9 digits, but always a +
.
Print first and last commit of each contributor to a repo
git log --pretty=format:"%ae %ai" | sort | awk 'contributor == $1 { lastContribution = $0 } contributor != $1 { contributor = $1; if (lastContribution) print lastContribution; print } END { print lastContribution }'
Print the last commit of each contributor to a repo since given date But the email is not always the same
git log --since=2024-10-01 --pretty=format:"%ae %ai" | sort -r | awk '!a[$1]++'
https://unix.stackexchange.com/questions/159695/how-does-awk-a0-work
Peter Murrays action defines activity as commits, issues (creating), making issue or PR comments:
module.exports = {
COMMITS: 'commits',
ISSUES: 'issues',
ISSUE_COMMENTS: 'issueComments',
PULL_REQUEST_COMMENTS: 'prComments',
}
Eureka 🥳
pygithub
- get number of commits for each contributor to a repo since a date (by github login - not git author name which varies)
current_users
active_repos
active_users
login names including any values to ignore in checks such as botscurrent_users
(to remove contributors no longer in the org) active
(ie in the active_users
list) check activity by getting their count of commits since the given dateactive_users
listactive
list are deemed inactive
and may be terminated Can extend this logic to include issue creations, issue commenting, PR creation etc
This should take less than number of repos number of users as a once a user is marked as active they are no longer checked.
In fact its max limit is considerably less than this anyway: number of repos number of contributors to that repo (which is not the total user list as nobody contributes to every repo); and again it should be even less than this as if a contributor to multiple repos has already been checked and marked as active
then they do not need to be checked again. This suggest ordering the repos by largest number of contributors and checking in that order first - though this may be overkill.
User Need
As a Ops Engineering Team I want an automated process to identify dormant GitHub users that mimics the GitHub Dormant user download so that I (as an Enterprise owner) don't have to run the manual process to download the GitHub Dormant user csv file.
Value Automation over a manual process.
Functional Requirements (What):
Non-Functional Requirements (How):
For each User across both Orgs we need the date of their most recent activity
last_active_date
(for activity that counts as being non-dormant 🤯). Then we can say that those with alast_active_date
date before a date are considered dormant (as a direct User of GitHub)- this information may be combined with Auth0 to determine if they have logged into the AP - which counts as indirect GitHub User activity.The GitHub REST API facilitates endpoints from a repository perspective, so for each repository we can collect the most recent commit date of its contributors. We may need other repository interaction information - but commits is a place to start. Other activities: Pull request comments, issue comments, issues (creation?).
Start with a list of all GitHub users and repos. Create an empty dataframe of user and
last_active_date
. Cycle through each repo collecting the dates of that repo's contributors actions and updating the dataframe with the date if it is later than the one already there.Diagram
Assumptions
Acceptance Criteria:
Notes
We know about https://github.com/peter-murray/inactive-users-action (action for identifying inactive users that mimics the GitHub Dormant User manual download), but we know that it works by checking each User in each Repository, so it makes
number_of_users
*number_of_repos
calls; this will time out for our Orgs (>2000 repos).I think that the GitHub Dormant User Download (and the Peter Murray GH Action) collect more information that simply whether or not a user is active. We don't need the details to decide dormancy; we just need the
when
.