hackforla / website

Hack for LA's website
https://www.hackforla.org
GNU General Public License v2.0
317 stars 752 forks source link

"Contributors" only showing GitHub commits #346

Closed cnk closed 4 years ago

cnk commented 4 years ago

What is the 'contributors' GitHub API endpoint returning? I think it is people who have committed code to the repository.

How do we list/reward people who make non-code contributions? Is there GitHub data that includes contributions such as commenting on issues? Project management, etc.?

cnk commented 4 years ago

There is a "teams" endpoint - but I think that shows anyone mapped to the repo AND I can't get any of the repos to work. All give 404 https://api.github.com/repos/hackforla/311-data/teams

We could try scraping the user or assignee on the issues url: https://api.github.com/repos/hackforla/website/issues That is a fair amount of work. Does it solve the problem that the contributors API endpoint doesn't include UI/UX folks

cnk commented 4 years ago

TODO: Email GitHub re our 311-data project. Wanting to acknowledge contributions that do not honor non-code commits.

joelparkerhenderson commented 4 years ago

How do we list/reward people who make non-code contributions?

A quick easy way is for each repo to use a top level "CONTRIBUTORS.md" file.

cnk commented 4 years ago

Issue I raised with GitHub support:

Hack for LA is the Los Angeles brigade of Code for America. We are improving our web site to better show off our projects and the people building them. We are using the GitHub collaborators API to pull lists of people committing code to their main repository. For example, one of our projects is 311 Data: https://www.hackforla.org/projects/311-data is showing the people listed in https://api.github.com/repos/hackforla/311-data/collaborators

However, a lot of the current work on that project is UI/UX design. Our UX people are contributing greatly to the project but most of their contributions can be seen on the projects issues board, https://github.com/hackforla/311-data/projects/4, not as code commits. Is there a way for us to get a list of the contributors to our project boards? Ideally we would like a combined list that rewards project board contributions AND code contributions equally but a list of just project board contributors would also be useful. Is there an API endpoint we can use that I am just not seeing?

Their response

This is an interesting question! Are you specifically looking for the authors of those issues in the Project? If so, could you consider iterating through the following API endpoint and fetch those issue creators?

https://developer.github.com/v3/issues/#list-repository-issues

If you want to include issue comments too, you could fetch those from this endpoint:

https://developer.github.com/v3/issues/comments/#list-comments-on-an-issue

You mentioned using https://api.github.com/repos/hackforla/311-data/collaborators which will show users added as collaborators. I also wanted to point you to this endpoint which provides the same data as https://github.com/hackforla/311-data/graphs/contributors and might be more useful:

https://developer.github.com/v3/repos/statistics/#get-contributors-list-with-additions-deletions-and-commit-counts

I hope this is helpful! Please do let me know if you have any questions.

cnk commented 4 years ago

So bottom line is we can get non-code contributions by scraping the issues and comments API endpoints for a project - but we have to do that aggregation ourselves. There isn't an existing aggregation like there is for code commits. Since projects have a lot of issues, getting a full list would involve multiple requests per project (since I expect pagination to kick in). Is this worth the effort? Or do we want to do something that doesn't order by total number of issues/comments? We could have designers commit something so we have them in the Contributors API and then add "recent contributions" that pulled data from https://developer.github.com/v3/issues/#list-repository-issues which includes issue creation and pull requests. Issues come with who created them and who was assigned. Is that sufficient?

ExperimentsInHonesty commented 4 years ago

@cnk will determine if you can scrape the people who comment vs assignees.

harishlingam commented 4 years ago

This issue requires some knowledge of Javascript and APIs.

KianBadie commented 4 years ago

Update

I've been working on this issue this week. What I've been trying to do is add functionality to the script we currently have in order to loop through the comments on issues (open or closed) that are tied to a repository.

The GitHub API returns responses to endpoints in pages. So for example if I ask for all the issues under a repository with 250 issues, it might return 30 issues (I think it maxes out at 100) under "page 1" and return more url endpoints that return the resulting issues (page 2, 3, 4, and so on). As of right now, I was able to get the script to a point where it can gather the urls for all pages. This means that if a url endpoint returned 8 pages worth of results, the script is able to get the amount of pages needed to get all the issues and no more.

I have my own repo where this is being done here. However, the functionality I mentioned is not reflected on the repo yet. I was planning on pushing it when I reached a more concrete milestone than creating the needed urls.

KianBadie commented 4 years ago

@leonelRos So I updated my repo to fetch every issue listed under a repository and store the "comments_url" for each issue in the github-data.json file.

However, I did just stumble upon this, which might just be a quicker way to do what we are doing. From my impression, this endpoint just fetches all the comments in a repository. Rather than doing what I have been trying to do, which is fetch all the issues and then get all the comments on those issues.

cnk commented 4 years ago

Yes I think that comments url, https://developer.github.com/v3/issues/comments/#list-comments-in-a-repository. Looks like when you are in the repository data, that url is given to you as issue_comment_url. For example in https://api.github.com/repos/hackforla/website we have "issue_comment_url": "https://api.github.com/repos/hackforla/website/issues/comments{/number}", and can just truncate that to "issue_comment_url": "https://api.github.com/repos/hackforla/website/issues/comments"

leonelRos commented 4 years ago

@KianBadie @cnk yeah. I think this definitely helps us to fetch all the comments. Now, I am gonna see how can we implement this. I currently playing around on my branch. I will post when I reach a new step.

KianBadie commented 4 years ago

@cnk @leonelRos Yeah, @cnk you are absolutely right. I implemented that endpoint and used it to fetch all the comments. It is available to see on my repo. I think the next task could be parsing all the comments data into something more meaningful. I imagine there are going to be repeats in comment-ers, so maybe we would want to keep track of who commented how many times? And maybe we could put this data in it's own field in the github-data.json file? Or would we want to just remove any redundancies?

@leonelRos wanted to help me out, and at this point, this would be the next task to do. @cnk should we parse this? And if so, how should we parse it?

And another thing for us to investigate: if that new endpoint we just discovered really does get all the comments in a repository. Closed issues, open issues, pull request, whatever.

cnk commented 4 years ago

Yes I think you want to keep track of who commented how many times. I would suggest you parse that API data into a list of JSON objects like we have for the data section of contributors here https://github.com/hackforla/website/blob/gh-pages/_data/github-data.json If you name it something else like "commentors" (at least for now), then we could easily swap back and forth between commentors and contributors while we are getting a feel for which gives us a better sense of the various projects.

And yes, I would love to figure out how comprehensive the comments are. If it is everything, this might be a great replacement for the commit-only info in the contributors endpoint.

KianBadie commented 4 years ago

So I was able to compile a sorted list of commenters and how many times they commented. There is an issue though that I realized. I found that someone made a pull request, had it merged, but did not show up on the commenters list. I'm assuming this is because she just made a pull request that got merged and didn't comment on any other issues. However, she also didn't show up on the contributors data either. That one I would have to be informed more why she didn't show up there. But can be an issue for future reference!

KianBadie commented 4 years ago

So I made the script get a list of commenters per project that is sorted by how many comments they made. The changes are reflected on my repo. Only thing that is bugging me is that I am getting an inconsistent error Screenshot from 2020-04-25 17-25-40

Pretty confident nothing changed from the first time I ran the code to the other times. The error would pop up, I would then try to console.log() the thing that was throwing the error, and then it wouldn't throw the error anymore. Really close to it being done, other than this bug. It's hard to replicate the bug.

KianBadie commented 4 years ago
  1. Progress - Have a sorted list of people who have commented on repository issues.
  2. Blocks - I have gotten an inconsistent error a couple of times that prevents githubapi data from being saved. If this was implemented and the site ran into the error it would be bad for project home pages.
  3. Availability - Tuesday, Friday
  4. ETA - Sunday 5/03 I will try to have the script in a state that it is ready to be merged into the site.
KianBadie commented 4 years ago

I think I have started to figure out the cause of the error. In one instance where I got the error, I inspected the list of comments for a project and found something that looked like this.

comments:
   url: "..."
   data: [
      {...}, // Comment object
      {...}, // Comment object
      ...,
      "403 - {\"documentation_url\":\"https://developer.github.com/v3/#abuse-rate- 
     limits\",\"message\":\"You have triggered an abuse detection mechanism. Please wait a few 
      minutes before you try again.\"}"
   ]

So I believe the error is coming up because the loop I have parses the comments list and expects the whole list to be comment objects. Obviously, because I ran into a rate limit violation, all of the objects were not comment objects, but one was a string. So the actions I was doing with comment objects broke when it ran into a string. So I think the important question now is: Did the rate limit abuse error come from me running/testing my own code frequently? Or did it come up because the script itself made a lot of request (we are asking for a lot of information). I will provide an update.

cnk commented 4 years ago

@KianBadie were you using a Personal Access Token while running your script? I ran into rate limits with the original api scraping and was able to solve it by passing - https://help.github.com/en/github/authenticating-to-github/creating-a-personal-access-token-for-the-command-line

But what you are doing might be more data intensive and so might run into the access limit even with a token.

Since this is intermittent, that probably won't be a problem once this is in production. But you will need to catch the error and then not overwrite the existing file if the nightly script fails.

KianBadie commented 4 years ago

@cnk I am using a personal access token, so unfortunately it seems I am maxing out the limits.

Yeah that's exactly what I was thinking. Weirdly enough I ran the script for the first time in a couple days and immediately ran into the error. Ran the script about a minute or 2 after and it worked fine

Update

So I did some searching online and I am realizing that maybe I need to start thinking of the rate limit as an actual rate and not just a cap. Maybe I am making too many request in a short amount of time? Here is an email I sent to support:

Hello, I had some questions about the rate limits for the Github API.

I am currently working on a script that is attempting to get data across multiple repositories under a single organization. The script is making multiple API request to get things like repository languages, repository contributors, and lastly (this is the big one I think) all issue comments on a repository. I am doing the last one to see which users are active in each of these repositories who might not be reflected in contributors endpoint because they don't contribute code (maybe they are UI or UX people or something).

I have been running into this error "403 - {\"documentation_url\":\"https://developer.github.com/v3/#abuse-rate-limits\",\"message\":\"You have triggered an abuse detection mechanism. Please wait a few minutes before you try again.\"}". The script makes 220 request and I do not think I hit the rate limits, as this was the result of checking using the API:

{ resources: { core: { limit: 5000, remaining: 4164, reset: 1588374570 }, search: { limit: 30, remaining: 29, reset: 1588372484 }, graphql: { limit: 5000, remaining: 5000, reset: 1588376027 }, integration_manifest: { limit: 5000, remaining: 5000, reset: 1588376027 }, source_import: { limit: 100, remaining: 100, reset: 1588372487 } }, rate: { limit: 5000, remaining: 4164, reset: 1588374570 } }

From some searching, I think I am starting to realize that I was thinking of the rate limit as a cap rather an an actual rate. I wanted to ask and confirm if the Github API returns this error when too many request are made in a short amount of time even though the cap of request are not hit? And if so, do you have any advice on how to achieve the goal of the script using the Github API? Thank you.

Hopefully we can hearback soon. I am a bit stumped at the moment on this one.

Also, I tried to run into the error again and ran the script about 10+ times in a, what I would call, quick succession and the error did not get thrown. Which makes it more inconsistent that what I was initially thinking, and makes me scratch my head more.

KianBadie commented 4 years ago

Was not able to figure out a way to make the script work with the rate limits. GitHub has not responded yet either. I reached out to Steven Ettinger for some help and he helped bring up some ideas. One is that maybe we could break up the GitHub actions to run at different times in the day and split up the work among the actions. That way we can make half the request at X time and the other half at Y time.

Another thing I wanted to bring up that I thought about. How are we going to decide which users to display from the comments data? At first I thought in order of most comments, but what if someone just joined and is consistently contributing, but because they are new they don't get recognized?

ExperimentsInHonesty commented 4 years ago

@KianBadie its all time high contributions appear earlier. Every contributor will appear, but the more total contributions you make the closer to the top of the list you are.

ExperimentsInHonesty commented 4 years ago

@cnk - thoughts about rate limiting? Also @KianBadie will this be less of an issue when we are pulling daily? Or is it already rate limiting on daily. And what are the limits

KianBadie commented 4 years ago

New message I sent to GitHub support: Hello,

My name is Kian and I am with Hack for LA, which is the Los Angeles brigade of Code for America. I reached out recently on the same issue, but wanted to reach out again on behalf of Hack for LA this time.

On our site we display multiple GitHub projects that are under Hack for LA and we want to represent more of the people who are working/contributing to these projects using the GitHub API. For example, our 311 Data project (https://www.hackforla.org/projects/311-data) shows contributors using this GitHub API endpoint: https://api.github.com/repos/hackforla/311-data/contributors

However, we also wanted to show people who do not contribute code, like UI/UX people who are contributing on issues through there comments. Our solution to this was to collect all the issue comments and collect a list of everyone who commented. As of right now, we are doing this by going through each project under our organization (about 20+), calling the "List comments in a repository" endpoint (/repos/:owner/:repo/issues/comments), and then making subsequent calls to however many pages of results came about through pagination (https://api.github.com/repositories/:repo/issues/comments?page=:page).

The project with the most pages of results had about 31 pages. So already, that is 30 request for one project. Other projects under 10 or maybe 5 pages. The script overall makes about 220 request to the GitHub API (1 of those was to search, so I know we are not hitting the 5000 per hour limit). However, the script is inconsistent in terms of working. Sometimes it will work fine, and sometimes it will have this error message in the results for our data: "403 - {\"documentation_url\":\"https://developer.github.com/v3/#abuse-rate-limits\",\"message\":\"You have triggered an abuse detection mechanism. Please wait a few minutes before you try again.\"}." There have been times were I get the error multiple times in a row, and other instances where I run the script multiple times without the error (which I ran multiple times just now with no error).

I'm assuming the error is happening because we are making those 220 request (I'm confident the number of request has stayed the same throughout testing for the error) in such a short amount of time, which is what is abusing the rate limit? But I am still not sure why the error would happen inconsistently though. Would you be able to confirm my theory on the rate limit and clarify the inconsistency in the error? Would you have any advice on how we can better use the GitHub API to achieve our goal?

Additional context: we are going to run this script once a night using GitHub actions. It is implemented already, just without the "Issue Comments" data fetching.

Thank you very much for you time, Kian Badie

ExperimentsInHonesty commented 4 years ago

Setup meeting with @cnk and @KianBadie for next week with Github guy. Cnk says "I have meetings from 1-3 Monday and 10-11 Tuesday. Other than that, I am free"

Screen Shot 2020-05-10 at 9 44 37 AM

Screen Shot 2020-05-10 at 9 45 33 AM

ExperimentsInHonesty commented 4 years ago

setup appt for 3:20 monday.

ExperimentsInHonesty commented 4 years ago

@cnk @KianBadie and @ExperimentsInHonesty had a meeting with Github rep, and will explore the new feature "discussions" more. It's been enabled in beta now on the website repo.
Rep also agreed to make sure that kian's issue gets answered. @KianBadie is there a link you can provide to the open issue, so we can email her with that?

KianBadie commented 4 years ago

Update Sent email to owner of the API ecosystem team, a connection we got through the person we talked to last Monday 5/11. It's pretty similar to the emails I have sent in the past, but with some updates. I also have an another update below this email.

Hello,

My name is Kian Badie from Hack For LA, and on Monday 5/11 I spoke with Evangeline Liu about GitHub Discussions and the issues we have been having with the GitHub API. She told me to get in touch with you, which I am late in doing. I'm sorry if you were expecting this email sooner, if at all. Thank you for the time you and your team have given us. In summary, the main issue I have been having with the GitHub API is inconsistently running into abuse-rate-limits errors in the past while trying to use the GitHub API for our goals. 

Some background: I am with Hack for LA, which is the Los Angeles brigade of Code for America. In our organization, many projects use GitHub to collaborate. On our website, each project has their own page and we use the GitHub API to display some information, like what languages they are using and who project contributors are. However, we also wanted to show people who do not contribute code, like UI/UX people, who are contributing through their issue comments. The solution we tried was to collect all the issue comments and collect a list of everyone who commented. As of right now, we are doing this by going through each project under our organization (about 25+), calling the "List comments in a repository" endpoint (/repos/:owner/:repo/issues/comments), and then making subsequent calls to however many pages of results came about through pagination (https://api.github.com/repositories/:repo/issues/comments?page=:page). In total, the script overall makes about 230 request to the GitHub API (1 of those was to the search API endpoint), so I know we are not hitting the 5000 per hour limit (checked number of request using this endpoint: https://api.github.com/rate_limit). However, when I was testing the script a couple weeks ago, it was inconsistent in terms of working. Sometimes it would work fine, and other times it would have this error message in the results for our data: "403 - {\"documentation_url\":\"https://developer.github.com/v3/#abuse-rate-limits\",\"message\":\"You have triggered an abuse detection mechanism. Please wait a few minutes before you try again.\"}." There have been times were I get the error multiple times in a row, and other instances where I run the script multiple times without the error. At the time of writing this, I revisited the script and ran it multiple times in a row without any errors. 

I'm assuming the error is happening because we are making those 230 request in such a short amount of time, which is what is abusing the rate limit? Like maybe we are making too many request per second? But I am still not sure why the error would happen inconsistently though. Would you be able to confirm my theory on the rate limit and clarify the inconsistency in the error? Would you have any advice on how we can better use the GitHub API to achieve our goal? I think the last time I was testing the script a couple weeks ago I could not replicate the error either. But I do not think I changed much in the code (it's possible I didn't change anything) between getting the error and not getting the error. 

Additional context: we are going to run this script once a night using GitHub actions. It is implemented already, just without the "Issue Comments" data fetching. 

Thank you very much for you time. Again, we at Hack For LA really appreciate the time you all have been given us already.  Kian Badie

I ran the script like 10+ times tonight and did not run into the error once. I think the last time I was testing the script a couple weeks ago I could not replicate the error either. We still ran into the error in the past though, and I don't know why we are not getting it now. As of right now, I don't know if that hints the functionality is possible for some reason now. Also depends on how it would mix if the additional request added in #458 get integrated.

KianBadie commented 4 years ago

Emailed the person we talked to about GitHub Discussions to see if they can check if the API Ecosystem team owner got our email.

KianBadie commented 4 years ago

Unfortunately, no hear back from either the API Ecosystem team owner or the person who originally connected us.

ExperimentsInHonesty commented 4 years ago

@KianBadie look for endpoint where you can get get issues updated by dates. use that to find all issues updated after X date, then use those issue numbers to look for comments on those issues after x date

@harishlingam suggested stack overflow search

KianBadie commented 4 years ago

@ExperimentsInHonesty There is a parameter we can put where we can get issues that have been updated by a certain date. See the since parameter on this page.

However, this does include updates like if someone updates a comment. So no new comments could be added but if someone edits a comment, it will be marked as updated.

Edit It turns out, this can be applied to the "Repository issue comments" endpoint as well. So we can fetch all issue comments in order of when they were last updated as well.

ExperimentsInHonesty commented 4 years ago

@KianBadie let me know when you have time to talk about the finding above. I think we might have a path forward on this.

harishlingam commented 4 years ago
  1. Progress
  2. Blocks - None
  3. Availability - This week
  4. ETA - Next Tuesday
KianBadie commented 4 years ago

I have an update:

So I modified the script that we have in my own little test repo that does seem to fetch comments made from the past 24hrs for each repo. I also made a seperate script that aggregates all the existing comments and makes a table that has each commented and how many times they commented. So I have a work in progress of a way to get updated comments and an initializer that builds the "starting point" that the updated comments data will build upon.

I did have a question though. If we are only using a single list of contributors, how are we going to compare contributions made from commits vs comments. Because our current list of contributors is based of commits. If somebody has made 100 comments, how would we integrate that into that list? Since comments could vary in impact, it doesn't seem entirely possible. Should we have 2 list? One list for Contributions made from code and one list for Contributions made through discussion?

KianBadie commented 4 years ago

Will need to compare current contributors data with a new contributors data lists being composed of solely commenter data. If there are drastic changes, then we will need to discuss how to resolve the 2 sets of data. But if the order of contributions from commit contributions is roughly the same then maybe we can solely rely on the comments data.

KianBadie commented 4 years ago

@cnk I have a branch that compares the 2 different sets of data. The contributors data on this branch is replaced with the data gathered from getting all the issue comments and sorted to where the top commenter is on the top. Here is a preview of how it is different

(Commenters data on left, original on right) Screenshot from 2020-06-16 14-42-03

Feel free to check out the branch and look at other projects. From a surface level observation, I have 2 notes that I think are important:

KianBadie commented 4 years ago

I spent some time working on both scripts (the initializer that builds all the comments data and the script we already have). The initializer now pull comments data from all repos if the repo we keep track of is under a different org.

I am not finished with the other script, but I did do a good amount of refactoring because implementing the comment fetching feature made more sense to me if I refactored it the way I did (kind of hard to explain in a comment, but I could try to explain if anyone would like me to). I also tried to start refactoring because the script was growing bigger and was starting to have more moving parts and was getting hard to keep track off. I thought organizing it while working on this would be a good idea. I think it might be pretty close to being done though. You can check out the work on my repo in the branch I will link here. Once the scripts are done, it's a matter of making it work in the context of the website code, which shouldn't be too bad.

Screenshots

Here are some screenshots of what the contributors data looks like when using comments data vs commit data. I didn't add a screenshot of our project page because it is posted above. If you would like to check out the data yourself, it is on my branch linked here.

Commenters on the left vs commits on the right

Public Tree Map

public-tree-map

VRMS

vrms

311 Data

311-data

Engage

engage

ExperimentsInHonesty commented 4 years ago

We made a decision to count all contributions (comments and commits in total, no weighting).

KianBadie commented 4 years ago

GitHub support got back to me a week ago. They did confirm the abuse rate limit errors were happening because we were making such a short amount of request at the same time. They recommended the "the best approach would be to slow down your script, e.g. by adding some sleep statements between API calls. Another approach would be to make requests as before, and then if/when you hit the limits -- look at the Retry-After response header and it will tell you when you can resume making requests."

They also linked this documentation page, with the relevant point being to make request sequentially instead of concurrently and notes about the "Retry-After" header that was mentioned before. We decided to implement this feature in a similar fashion to how we have implemented the past features of this GitHub script (concurrently without the "Retry-After" header) and to possibly refactor the script, to reduce the chances of getting abuse rate limit errors, in the future.

KianBadie commented 4 years ago
  1. Progress - Pretty much finished with initializer script in my test environment. The data fetching script fetches recent github issue comments for each repo, but not for repos that are part of organizations with multiple sibling repos under it.
  2. Blocks - None
  3. Availability - Tuesday evening, Thursday and Friday most of the day
  4. ETA - Sunday 6/28
KianBadie commented 4 years ago
  1. Progress - I made progress this week, but unfortunately did not finish the task. Where I'm at right now is trying to get the script that fetches recent comments to work with the existing comments data. I did refactor the code a bit, and made it so that contributions are combined from the commit and comments data.
  2. Blocks - None
  3. Availability - Thursday, Friday, Saturday
  4. ETA - The next step after the scripts are working is to get them working in the context of the website (because I am doing this in my test environment). The script involves reading from the github-data.json file, and I had trouble last time trying to figure out how to access repo files in GitHub actions, as it wasn't too straightforward as it is with a normal script. Because of this, I think I will have a pull request by 7/05. Sorry for the delay!
KianBadie commented 4 years ago
  1. Progress - Both scripts are pretty much finished. The script that initializes all the comments data and the script that fetches the commit and recent comment data both seem to work. I have a branch that has the refactored/new data fetching script that works and commits the new data.
  2. Blocks - Since there is a script that initializes all comment data up to date and a script that fetches comment data within the past 24hrs, we need to figure out the timing of implementing both scripts. Also, not a block per se, but I've been questioning how to test the code. Both in verifying that the data is correct and that we won't run into rate limit errors in the future (the script hasn't run into any recently in my testing).
  3. Availability - Thursday, Friday, Saturday evening/day
  4. ETA - I can have my branch/data merged by Tuesday. With the new updates added, I will still try to get it done as soon as possible (Tuesday). But I will say Friday it will be done just to be safe and not let down expectations.

Update: From today's meeting 7/5, we decided to change the script to get comments from the past 24hrs to fetch comments from the last time the script was run (saved as an environment variable of sorts). Also, a way to test the script would be to create another scrip that only tests the data for one project, and then compare it to data for that project gathered through the script.

KianBadie commented 4 years ago

I have created the testing script mentioned in the previous comment. After using it, I got the same results as my "initialization" script for a project with multiple repos and a project with a single repo. So I am assuming my initilization script is correct. Now it is time to test and touch up the github-api script.

However, there is an important note that I just learned. While testing, I noticed that a the github-api script had recorded 1 more comment than my test results. I think this is because the github api script fetches comments since a specific date, and the github api will record an updated comment as matching this criteria. My testing script however, just fetches all the comments since the beginning of the project. So I think even though a comment has been updated multiple times, the testing script will only record it as 1 comment. This isn't a problem at first, because I was able to easily figure out about the updated comment since it was one off. But if we were to test in the future, I think the github-api data will start to deviate greatly from the script that fetches the total amount of comments because it will aggregate comment updates over time while the test script just gets the total amount of comments.

For now, I will just assume the small difference in count are due to comment updates. However, in the future if the data we have is lost, I don't think we can rebuild it. This is because the github api records only the last time a comment was updated. So if we kept track that a comment was updated 7 times over the course of a week and we lost that data, we would only be able to see when that comment was made, and when it was updated.

A risk that comes to mind with this is if the script is implemented and there turns out to be a error in the way we fetch the data, we can't necessarily fix it and then rebuild. Since the script that initializes all the comments works like the testing script.

This is all from my knowledge. For all I know, there could be some endpoint or way to see all past updates for a comment.

cnk commented 4 years ago

I am of two minds about which is the better count - counting comments or counting comments plus their updates (as would be the case for the incremental counts). But either method is likely to produce the same order of contributors. So I am not worried about the counts changing if we lose the data and have to recreate it.

KianBadie commented 4 years ago

@cnk I think you're right. Since we aren't showing total contributions, the order is more important. And I think it is pretty possible the order will stay the same. I think changing the code so that it doesn't recognize updates wouldn't be a big change in the code, so I will code it that way for now just so it is easier to test and keep track of. If we do want to keep track of updates, I think I could easily remove it (I think it will be a simple if statement)

Edit: Funny situation. We actually need to count updates. I was originally omitting updates by checking if the updated time was different from the creation time. This works in the situation if someone made a comment and edited AFTER the nightly script already recorded the comment. However, if I make a comment on July 11th 1pm, and the edit it at July 11th 1:30pm (aka edited the comment BEFORE the nightly script caught it), the script will not recognize the comment at all because of the if statement I mentioned earlier, which is a step backwards. This makes testing harder, as I can't use a script to compare answers because of the problems I mentioned in my earlier comment about counting updates.