Up-For-Grabs Statistics

johncmckim commented 8 years ago

Firstly, I really like the up-for-grabs concept. It would be interesting to see the affect of the up-for-grabs tags or report on engagement in general.

Using Github APIs specifically list issues, list collaborators and statistics api data could be pulled on issues with the up-for-grabs tag and who contributed to those issues. This could give some interesting information about the performance of the tag and community engagement in general.

In terms of code contributions, information that could be captured could include:

Issues that had pull requests from the community (non-collaborators)
Issues with the up-for-grab tags that had pull-requests from the community
General statistics about contributions from the community
other stats?

Other information about issues created and closed could also be interesting. However, that could be considered later.

I think this could provide valuable information on a Repo's community engagement, which seems to be what you are trying to encourage with this tag. The information could be fed back to Repo owners and possibly made publicly available.

I would love to hear what you think of this. I understand it may not fit into this Repo, a jeklly site probably can't do this. However, I wanted to float the idea anyway to see what you think.

shiftkey commented 8 years ago

I'm really interested in seeing us extract more data from the projects to surface to people visiting the site. Nudging successful projects up the search results would be a great way to motivate maintainers. A simple example of this is #233.

As discussed in #261 it's rather easy to get rate-limited currently, so perhaps we need to stand up a service that users hit (rather than hitting the API directly), and then start exploring data somehow.

I wish I had the bandwidth to look at this, but I'm open to helping someone else if they want to get down and dirty with fleshing this out.

johncmckim commented 8 years ago

I did notice the rate limiting. Both issues require some kind over back end service.

The comment about a Azure web job to rebuild pages is interesting. However, I personally think it would be wasteful to rebuild all the static content, to update a small portion of the content that actually is dynamic.

One solution could be to use AWS Lambda and AWS S3. Specifically, you could potentially use a Scheduled Lambda Function to make your API requests and upload the results to S3. The Jekyll site could then hit S3 instead of Github directly. This should be a very cost effective solution. If there's an Azure equivalent that could be good as well.

Another potentially free solution could be to use AppVeyor Scheduled builds to make the API calls and generate a JSON artifact which is uploaded somewhere (S3, Azure Blob, other?) to be pulled by in by the Jekyll site.

Other options would be to setup a service dedicated to doing this. However, I think it could potentially be solved without a server. Do you have any preferences for technologies or other thoughts about how this could be solved?

Depending on how you want to solve this, I can potentially help out. Though I like you, time can be somewhat hard to come by.

daveaglick commented 8 years ago

It just so happens that I know a static site generator that is designed specifically to handle complex code-based scenarios like this. I also know the maintainer of said generator is looking for some community projects to help out and apply it to.

So I guess the first question is: how wedded are you guys to Jekyll?

The other question I have is what specifically we would do with the additional data? Sorting by popularity is one good idea, but how to measure "popularity"? Number of stars, PRs in the last week or month? I've also seen other similar sites present number of issues, forks, PRs, etc. Would there be interest in each listed up for grabs project having a detail page with more information, or maybe a "fly-out" kind of like ProductHunt?

shiftkey commented 8 years ago

@daveaglick

So I guess the first question is: how wedded are you guys to Jekyll?

Speaking just for myself, I really enjoy the benefits that Jekyll gives us right now. Not to rule it out too early, but the current stack works really nicely for what we need. Before I "throw it all away" (my words, tongue in cheek) I'd really like to understand more about this alternative and what it gives us.

The other question I have is what specifically we would do with the additional data?

For me, the big benefit of this data would be to identify and promote projects which stand out and achieve the goals we set out for this project - of course how we measure those things is something we can discuss. Ultimately I'd love to surface those results on the site so people can discover these active, successful projects more easily.

For me, there's two different sorts of data here - the projects (which are relatively stable - I don't recall removing a project from this list) and then the data relating to each project (which is as dynamic as we want it to be). And with the crazy things you can do in a browser right now, I'm still leaning towards keeping these two separate, rather than regenerating the entire site as the data changes.

daveaglick commented 8 years ago

I really enjoy the benefits that Jekyll gives us right now

Totally understandable. Maybe I'll put together a little PoC so you can see what I'm thinking. No pressure, and it'll be a good excersize in any case.

Out of more broad curiosity, what benefits are most important to you from Jekyll right now? The rapid rebuild on changes, use of front matter, templating language, etc.? Or is it mainly that it's already there and works well ("don't fix what ain't broke")?

shiftkey commented 8 years ago

Maybe I'll put together a little PoC so you can see what I'm thinking.

That'd be great.

shiftkey commented 8 years ago

Out of more broad curiosity, what benefits are most important to you from Jekyll right now?

For me, it's more about the GitHub Pages support (that is, a superset of Jekyll):

rebuild and deploy whenever the repository is updated
templating and separation of markup from user-generated content
zero infrastructure to manage and administer

johncmckim commented 8 years ago

Since this went a little quiet, I thought I'd make a little demo. I created a separate repo as it's just an proof of concept.

This node script iterates the project YAML files, requests the issues and then outputs the counts to a json file. This could be done as part of a build process. The outputted JSON can then be uploaded to an appropriate place and the jekyll site can hit that instead of the API directly.

This is a cut down and simplified version of the code (see the link above to test it):

// ... require 'fs', 'path', 'lodash', 'promise', 'yamljs', 'octonode'
// ... create path variables

// parse configs
var projectConfigs = _.map(projectFiles, function(fileName) {
  // ...
  return YAML.parse(fileContent);
});

var client = github.client();

var linkRegex = /github.com\/([^\/]+\/[^\/]+)\/labels\/([^\/]+)$/;
var issuePromises = [];

// load configs from yaml
_.each(projectConfigs, function(config) {
  var repoUrl = config.upforgrabs.link;
  var gh = repoUrl.match(linkRegex);
  if (!gh) {
    return;
  }

  var repoName = gh[1], label = gh[2], ghrepo = client.repo(repoName);

  issuePromises.push(new Promise(function (resolve, reject) {
    ghrepo.issues({ labels: label }, function(err, data, headers) {
      resolve(/* ... result ... */);
    });
  }));
});

// wait until all issues resolved
Promise
  .all(issuePromises)
  .then(function (issues) {
    // reduce results to appropriate format
    var issueCounts = _.reduce(issues, function(result, item, key) {
      var hasError = !!item.response.err;

      result[item.repo.name] = {
        hasError: hasError,
        error: hasError ? item.response.err.message : null,
        count: hasError ? null : item.response.data.length,
      };

      return result;
    }, {});

    // write results to disk
    fs.writeFile(outputFilename, JSON.stringify(issueCounts, null, 2));
  });

This is just getting the issue counts. It could potentially resolve #261 as this would only need to be run at limited intervals. It could then be expanded to start retrieving and processing other data to produce statistics instead.

If this is a solution that interests you I can help set it up. The main questions to take this from a concept to solution are:

What is used to run it? (Travis would work well)
How regularly would you run this?
Where is the issuesCounts json stored?
Would sit in it's own repo or this repo?

I see a few options for it:

This could become part of the website build process. The scripts are part of this repo and executed in the travis build. The resulting json would then just become part of the Github pages site. However, if it has to run regularly on a schedule to update the data, the whole site is being rebuilt constantly. Furthermore, Travis doesn't support scheduled builds so you would need to use something to trigger the builds on a schedule (https://nightli.es/ or similar).

Otherwise, it could become a separate service. The scripts are in a new repo and executed by some build process (Possibly Travis, maybe AppVeyor as they support Node and scheduled builds). The output is then stored somewhere (S3, Azure Blob, something else) and the website uses that as the endpoint instead.

What do you think @shiftkey?

shiftkey commented 8 years ago

@johncmckim that's interesting, but I really want to avoid the whole "build step" option. So I'll put my money where my mouth is and publish a little demo repo here that I knocked together this afternoon which shows what I was thinking:

https://github.com/shiftkey/up-for-grabs-api-demo

The live site is available here: https://up-for-grabs-data.herokuapp.com/issues/count?project=albacore (the project name is case-sensitive, and isn't the filename of the YAML file).

I went with the really lazy approach here:

Heroku deployments via webhooks - really didn't want to set up my own server
Memcached instance for the backing store
GitHub token can be specified as an environment variable - goodbye API restrictions!

I went with a simple, dumb endpoint to verify the caching is working as expected, but this could easily be used as a proxy for the browser making requests directly to GitHub - and we can shape the API however we want, and leave caching up to however we configure memcached.

Apologies that this is radically different to what you had in mind, but hopefully this approach interests you enough to help collaborate further on it!

johncmckim commented 8 years ago

@shiftkey I was taking a build step approach as I thought the aim was to avoid a web server. I like this approach too, really simple.

I haven't used Heroku myself, but if it's just writing node app I can do that. Happy to contribute. If you create issues on https://github.com/shiftkey/up-for-grabs-api-demo and mark some as up-for-grabs, I'll take a look at the ones I think I can help with. I could also create some issues for statistics related endpoints so the api side of this issue can be tracked there.

stale[bot] commented 5 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

up-for-grabs / up-for-grabs.net

Up-For-Grabs Statistics #285