sailuh / kaiaulu

An R package for mining software repositories
http://itm0.shidler.hawaii.edu/kaiaulu
Mozilla Public License 2.0
20 stars 13 forks source link

Github Discussions with GraphQL API #324

Open crepesAlot opened 1 month ago

crepesAlot commented 1 month ago

Purpose


GitHub Discussions is a public forum that allows for collaborative communication without needing to be tied to a specific project or related to code. It provides a more centralized space to hold discussions. The data that can be mined from the discussions and discussion comments can be of great interest to anyone interested in the relationship between users and a project's community. As such, we now need a way to retrieve any comments from this new endpoint.

About Github Discussions: https://docs.github.com/en/discussions/quickstart

Process


To do this, we're using the GraphQL API There is only a single endpoint: https://api.github.com/graphql Instead of GET requests with the REST API, GraphQL uses queries. The queries will only return the specified data. gh is a client Kaiaulu relies on to access the Github's REST and GraphQL APIs; this is what will be used to access GraphQL's single endpoint.

Limitations


As of now, there are several points of interest that cannot be retrieved with the GraphQL API endpoint.

Task List


crepesAlot commented 1 month ago

@carlosparadis I have some questions on the creating a parser for the downloaded file. How should I determine what should and should not be saved? Using the GitHub Issue Events endpoint as an example, the parser looks like: https://github.com/sailuh/kaiaulu/blob/810c183260e7f7f97ed7bc8d0f804647dba8c245/R/github.R#L150-L187 Looking at the example response on the REST API endpoint for issue events, there is quite a lot left out. https://docs.github.com/en/rest/issues/events?apiVersion=2022-11-28#get-an-issue-event Is this up to my own discretion of what is and isn't important?

carlosparadis commented 1 month ago

There are refresher endpoints in R/github.R if you want to learn how that is done (i believe it is the search endpoint). There may be another notebook too other than comments. However, to implement refresh you need an api endpoint that lets you select at least a starting date for the comments. Does this endpoint gives that?

Second, have you looked through GitHub to see if this is the only way to download Discussion comments? GitHub sometimes offer multiple API endpoints, so you want to be careful here you don't end up in the wrong endpoint.

Third, you may want to just try and do on the browser the request for a JSON (you can construct the request as a URL --- please don't paste the URL here with your API key, but do place the URL here as an example with a PLACEHOLDER as @beydlern did.

What gets parsed depends on what we discuss here may be relevant for the various analysis Kaiaulu does, so the easiest way would be for you to suggest for us to agree (i do need you to make sure you considering all possible endpoints).

Also, the motivation on your issue specification sounds a bit strange for me (purpose section). That sounds more GitHub motivation than our own. Our own motivation ties closer to @daomcgill work. Dao downloads mailing list data. And mailing lists can be about developers communication, users communication or more. Back in the day, a lot of people used mailing lists for both. This goes back to before issue trackers even existed, let alone GitHub.

Nowadays, issues exist, so "mailing list dev" in a lot of projects moved on to issue trackers. The equivalent for the user mailing list is discussions on GitHub (but other projects may use something else). Therefore, the purpose of the capability in Kaiaulu being available is so we can mine user interaction in projects on GitHub with projects. Some research may be interested in understanding how projects interact with users for analysis and community health for example (there are hundreds of studies that analyze StackOverflow questions!)

carlosparadis commented 1 month ago

I also suggest you take a look on the user side of Discussions so you understand the data you are getting (or not):

(Please don't create random questions, it will pollute kaiaulu repo, but you can always create a sandbox repo on your own account to play with it and delete your sandbox repo later):

https://github.com/sailuh/kaiaulu/discussions/new/choose

Notice how there are 5 types of categories. It is easier for us to discuss what data makes sense if you explain here from what I can see already.

carlosparadis commented 1 month ago

One last note: Before you spend too much time on code and API, you should make sure the endpoint is the correct one: I just noticed the API asks for a Team Slug. I have no idea what that is. If you go to the "Discussions" tab on Kaiaulu, you will notice there is no notion of Teams. It is just plain and simple discussions. So the URL I gave may be for another type of Discussions.

https://docs.github.com/en/search?query=discussions

You should check GitHub Docs and google to see if it is even possible to obtain the data in the first place!

crepesAlot commented 1 month ago

I'll likely have to rename this issue as well as a lot of the process, as after looking into it, REST API isn't actually for the discussions, but github Teams Discussions, a completely different thing, which was my mistake. I'm currently looking into GraphQL API https://docs.github.com/en/graphql/guides/using-the-graphql-api-for-discussions

carlosparadis commented 1 month ago

Sounds good!

crepesAlot commented 1 month ago

@RavenMarQ @carlosparadis Just mentioning this as I look into GraphQL API. I'm still trying to look into it, but it looks like it might be better to use GraphQL API rather than REST API for the pull request too. https://docs.github.com/en/graphql/guides/migrating-from-rest-to-graphql This is talking about how you can replace multiple REST calls with fewer GraphQL queries, but this is applicable when attempting to retrieve pull requests, commits, non-review comments, and reviews.

crepesAlot commented 2 weeks ago

@carlosparadis I've become more familiar with how the query works for GraphQL, and have started creating the functions. While I work on the functions, I wanted to run past you the information I am retrieving with the query. The information I am getting from the discussions:

Is there some other information that is either missing or isn't needed from this query? My current plan is to create 1 function to retrieve all the information, then one parser for the discussion post and another parser for the comments. However, I may change this as I work on it, as the query only gets the first/last x number of discussions/comments, so I may need to work around so that the function can get all of the discussions and comments.


Here is an example response I got for the first discussion post listed in Kaiaulu's disucssions:

{
  "data": {
    "repository": {
      "discussions": {
        "edges": [
          {
            "node": {
              "title": "Extracting features from a git repo with Kaiaulu",
              "bodyText": "Which of the following data points can be extracted from a git repo with Kaiaulu?  And for those that can be extracted, could you provide the instructions and/or a link to them?\n• total # of outstanding bugs\n• total # of outstanding non-bugs (typically feature requests)\nThe following are for a given period of time:\n• # of new bug issues\n• # of new non-bug issues\n• average bug-resolution time\n• average non-bug resolution time\n• # of active contributors\n• # of new contributors\n• # of bug-fixing commits\n• # of non-bug-fixing commits\n• # of LOC committed for bug resolution\n• # of LOC committed for non-bug resolution\n• # of emails on the project mailing list",
              "author": {
                "login": "BenjyNStrauss"
              },
              "createdAt": "2024-10-01T22:01:58Z",
              "category": {
                "name": "Q&A"
              },
              "answer": {
                "id": "DC_kwDOD0xXC84ApQNr"
              },
              "comments": {
                "edges": [
                  {
                    "node": {
                      "bodyText": "Hi Beni, @rnkazman\nThe way to go about Kaiaulu is asking yourself first \"where is the data coming from?\"\nFor bug data, that means you need to collect issue tracker data first. Then the question is, from which issue tracker? Kaiaulu can get you data from JIRA, GItHub, and Bugzilla.\nOnce you decide on that, you can go to the respective menu on the \"Reference\" page for any of these:\nhttp://itm0.shidler.hawaii.edu/kaiaulu/reference/index.html#-jira-\nand subsequently, see the Notebook to obtain the data. From the table obtained, you can then calculate any metrics you wish.\nSince you are interested in calculating a Metric out of the data above, then you can click on the \"Metric\" menu on the right:\nhttp://itm0.shidler.hawaii.edu/kaiaulu/reference/index.html#-metrics-\nYou will find there is a Bug Count Notebook there.\n\nThe same process can be used for your other metrics. For example, if you want to calculate contributor metrics, you can again ask yourself \"What is the data that I need to obtain this metric?\"\nThat would be \"Git\". Again, you can go to the docs page above and click \"Git\" and see the associated Git Log table. Same with Mail, etc. Same with \"Mail\".\n\nAs for how to link them: Depends on what you want to link them on, and what granularity. If you can give me something more specific, I can give you pointers.\nThe bottomline is: Kaiaulu will give you tables, and for most of them you will be making inner joins out of them to link. If you plan to connect people, see the \"Identity\" section and the associated Notebook.\nIf you are looking for a one button press solution to create the metrics above, we do not have that. But it should be relatively simple (group by, subset, inner joins) to get to them from the tables Kaiaulu gives you.",
                      "author": {
                        "login": "carlosparadis"
                      },
                      "id": "DC_kwDOD0xXC84ApQNr",
                      "createdAt": "2024-10-01T22:56:50Z"
                    }
                  }
                ]
              }
            }
          }
        ]
      }
    }
  }
}
carlosparadis commented 2 weeks ago

@crepesAlot thank you for the update! I guess one question here is, when you attempt to create a discussion it can be Q&A but also others. How does this affects the data format?

Could you create (not on Kaiaulu), a fork and on your fork experiment with the discussion tabs as example to see what you get out of the API? For instance, the Poll, Q&A and the others looks like their JSON would be different.

Screen Shot 2024-11-05 at 9 53 17 PM
crepesAlot commented 2 weeks ago

@carlosparadis I actually found that the format doesn't change at all. It still retrieves the title, body and any comments under the discussion without any issues regardless of its category. The field for answers simply returns null. The only thing would be for polls, it doesn't get the question and answers for the poll. But it looks as though the polls is just too new and none of the APIs have anything to support it yet.

I'm also hopeful that the refresher function will be relatively easy to create as not only can I get the time a discussion was created but also filter them more easily.

A list of discussions signatures from the documentation:

discussions(
  after: String,
  before: String,
  first: Int,
  last: Int,
  categoryId: ID = null,
  answered: Boolean = null,
  orderBy: DiscussionOrder = {field: UPDATED_AT, direction: DESC}
) : Discussion

They also list all the information you can pull from Github Discussions here: https://docs.github.com/en/graphql/guides/using-the-graphql-api-for-discussions#discussion

carlosparadis commented 2 weeks ago

You can go ahead and proceed with the code for this! I guess one open question if this is the same output for all responses is, what about the upvoted answers? We can't obtain the number of upvotes?

crepesAlot commented 2 weeks ago

Unfortunately there doesn't seem to be a way to get the number of upvotes, the closest thing would be getting the reactions to comments, such as a thumbs up, but that is separate from upvotes.

crepesAlot commented 2 weeks ago

@carlosparadis Having some difficulties with the github_api_showcase.Rmd I'm trying to use the code in the notebook to download commits so that I could see what the end result of the parse functions should look like for my own parser function, but I've been getting some errors.

> github_api_iterate_pages(token,gh_response,save_path_commit,prefix="commit")
Warning: cannot open file '../../rawdata/kaiaulu/github/commit/sailuh_kaiaulu/sailuh_kaiaulu_commit_p_1.json': No such file or directoryError in file(con, "w") : cannot open the connection

I haven't been able to figure out what the problem is. I ran the following lines to try to find it.

> file.exists("../../rawdata/kaiaulu/github/commit/sailuh_kaiaulu/sailuh_kaiaulu_commit_p_1.json")
[1] TRUE
> writeLines("test", con= "../../rawdata/kaiaulu/github/commit/sailuh_kaiaulu/sailuh_kaiaulu_commit_p_1.json")
> readLines("../../rawdata/kaiaulu/github/commit/sailuh_kaiaulu/sailuh_kaiaulu_commit_p_1.json")
[1] "test"

Shouldn't this mean that the file exists and that it has write permissions? Or have I made some major misunderstanding with how the functions work?

carlosparadis commented 2 weeks ago

Did you try opening the function definition and running one line of it at a time? I think the filepath constructed inside or relative to from where you are running may just be incorrect.

crepesAlot commented 1 week ago

@carlosparadis Thankfully, looks like we solved the problems with the notebooks and gh tool. After re-creating the file directory and including require(gh) in the notebook, the functions were able to run and download data perfectly fine. The problem didn't lie in the version of the gh tool afterall.