rundel / ghclass

Tools for managing classroom organizations
https://rundel.github.io/ghclass/
GNU General Public License v3.0
142 stars 22 forks source link

get details for repositories #72

Open andreashandel opened 5 years ago

andreashandel commented 5 years ago

First off, great package! Have started using it for my class.

Here's a feature I'd like to have and it currently doesn't seem possible: When I get a list of repositories, I'd like to see when they last got modified, and if possible, further stats (number of open/closed commits, number of issues, etc.). Ideally, a command like

org_repos("myorg", alldetails = TRUE)

would return a data frame with the 1st column the repo name, the further columns various stats about the repo. I don't know enough about the Github API to know if that's possible, but it would be great :)

My use-case: I asked students to create projects/repos, then a few weeks later I asked them to file issues on each other and update. I don't want to go through each repo to see if the issue was filed and what exactly they updated, it's enough for me to see that that there is at least 1 issue for each repo and the 'last modified' date is this week. Being able to run a function like the above one would save me a lot of time :)

rundel commented 5 years ago

Glad to hear that you are finding the package useful.

I think something like this would absolutely be useful and I we can definitely include this kind of functionality in the future. I think if we add it, it will be as a new function - something like org_repo_stats or something similar so that we can keep the current functions type stable (org_repos returns a character vector and other functions rely on that assumption).

The only other wrinkle is that a lot of these details live in different places within the V3 api - so it would require multiple requests per repo to collect everything (i.e. commits, issues, PRs, etc) which will be slow and may run into API rate limits. With that said, I think this should be possible with the V4 GraphQL based API to grab everything at once - but that isn't something I've played with much yet.

rundel commented 5 years ago

I've made a first pass at implementing something like this and it is available in the repo_stats branch. You should be able to install it via devtools::install_github("rundel/ghclass@repo_stats") if you would like to try it out.

The added function is called org_repo_stats and just needs the name of your organization. If there are additional details about the repos you think would be helpful please let me know.

andreashandel commented 5 years ago

Wonderful, thanks for the quick reply! I just tried it, works well, exactly the info I was looking for!

Suggestions: should the function maybe be called org_repos_stats to make it consistent with org_repos? Also, having the 'filter' and 'exclude' arguments from org_repos would be nice. Of course not a big deal, I can always filter afterward. But for orgs with tons of repos, might be worth having this options so the function only pulls the data one is interested in? Similarly, if pulling too much info from the API is an issue (as you suggested in your previous post), another filter option could be provided for the user to specify only the columns they are interested in (e.g. only information related to issues, or only info related to PR).

But those are just minor suggestions, I'm perfectly happy with the current version!

andreashandel commented 5 years ago

So something isn't quite working. I just tried to use the new feature to check how many open and closed PRs students had. When I used org_repo_stats, it showed me 0 for open and closed prs in almost all the repositories I was interested in. However, when I go to the student repository it shows (as it should) that most have at least 1 closed PR (that was their homework, most did it).

This is an example repo that has 2 closed PRs (as of me writing this) but shows 0 when I retrieve the info through the org_repo_stats function:

https://github.com/epid8060fall2019/MeganRobertson-coding

Curiously, 1 repo each had a non-zero entry for open and closed PR, so the function must be pulling 'something' from the github API, just not the right value somehow. I'm not sure where things go wrong.

Let me know if you want me to test/debug/reproduce anything or help in some other way.

Also, since I'm already at it: I noticed that for my purpose, having data on number of commits and contributors would also be nice to have :)

Thanks!

rundel commented 5 years ago

We made a couple of improvements to the function last week that included the number of commits as well as the ability to filter (really search) repo names.

I'll see what I can figure out about the PRs, I'll need to investigate exactly what is being grabbed by the api.

rundel commented 5 years ago

That was actually more straight forward than I thought, prs can be open, closed or merged - I was just missing that last option. Should be included now.

rundel commented 5 years ago

So getting contributor information doesn't seem possible with the V4 api but it is supported with V3 so I've added a repo_contributors function which reports on the number of commits but users for a set of repos. This went into master and i've merged those changes into repo_stats as well.

andreashandel commented 5 years ago

This is great, thanks for adding! It makes my Github class management easier with each new feature :)

A few further thoughts/suggestions: Students often don't quite spell repository names as instructed, so I need to search multiple times. For instance if I want to find the repository called 'NAME-coding', I need to do (at minimum) something like this:

r1 <- org_repos(orgname,filter = "Coding") r2 <- org_repos(orgname,filter = "coding") current_repos <- c(r1,r2)

It would be rather nice if one could use the dplyr syntax for filtering/selecting, e.g.

r <- org_repos(orgname,filter == dplyr::ends_with("oding"))

I think filter functionality in the tidyverse style would be great (instead of regex, which is more complicated). Also, since this is a comparison, I would switch to "==" instead of "=" since the latter is assignment. Using dplyr notation also means one wouldn't need the "exclude" setting in org_repos anymore, instead one would use a "-" or != or similar in the filter argument. Of course i have no idea how feasible this is :)

Same suggestions hold for the new org_repo_stats function and its filter argument. I also just noticed that org_repo_stats ignores capitalization, which I think is not ideal - though in my very special case actually helpful :) .

The ability to filter by different types in org_rep_stats is great, now I can easily get stats on all repos for a specific user.

I like the repo_contributors function! However, I noticed that things somehow don't agree between that function and org_repo_stats. If I sum up the commits for all contributors, the number seems to often be lower than the commits that org_repo_stats reports. The number of commits they differ by is variable. See screenshot below.

To combine contributor information with the other stats, I did a left-join like so:

cc <- left_join(repo_contr,repo_stats,by='repo')

That gave me everything in one place, though if I had a lot of contributors per repository (which I don't), there would be a lot of duplication. I'm wondering if it might be easier for further processing if repo_contributors returns a list, with each repo a main list element, then a vector of usernames and a vector of # commits for each user? I can see the advantage of staying with just tibbles, though.

Attaching a screenshot of part of this merged data-frame below so you can see the problem with the issues I mentioned.

Thanks a lot for all those updates! Happy to try out anything or provide any further feedback that might be helpful.

ghclass-screenshot

andreashandel commented 4 years ago

@rundel Hi Colin, just following up on this. My course is finished, I'm now planning to write up a blog post/brief tutorial (largely for my future self, and anyone else who's interested) describing how I used ghclass to manage my class. I want to keep things non-confusing, so only want to mention features that are/will be in the main branch on github (and eventually on CRAN). Should I install the current master branch from github and assume the features contained there are the ones available for the foreseeable future? Or are there still some side-branches with features that are about to be part of the main? Thanks!

rundel commented 4 years ago

Hi Andreas,

The main branch is stable and everything there should be sticking around for the foreseeable future. I have not had the time to sit down and merge the repo_stats branch but most of that should be moving over with minimal changes.

nicholasjhorton commented 4 years ago

I've found the org_repo_stats() function to be helpful. But it would be even better if there was an option that allowed the count of commits to be split out by contributor. This only really makes sense in splitting out the commits variable.

rundel commented 4 years ago

This is currently available via the ghclass::repo_contributors function - we're a bit hamstrung by the nature of what endpoints github makes available, hence all the separate functions. I may be able to get org_repo_stats to do this as well but it will take a bit of mucking about with the graphql query.

nicholasjhorton commented 4 years ago

Even better: I withdraw the suggestion!