turbot / steampipe-plugin-github

Use SQL to instantly query repositories, users, gists and more from GitHub. Open source CLI. No DB required.
https://hub.steampipe.io/plugins/turbot/github
Apache License 2.0
72 stars 28 forks source link

Add table github_outside_collaborators #404

Closed olafz closed 5 months ago

olafz commented 6 months ago

Steampipe already supports querying regular members of an organization through the github_organization_member table. There is another way people can access an arganization's repositories: as outside collaborator.

An outside collaborator is a person who is not a member of an organization, but has access to one or more of the organization's repositories

It would be very nice if we could query these as well using Steampipe.

References The API documentation to query outside collaborators for an organization is here: https://docs.github.com/en/rest/orgs/outside-collaborators?apiVersion=2022-11-28

ParthaI commented 6 months ago

Thank you, @olafz, for bringing this matter to our attention 👍. We'll investigate it further.

ParthaI commented 6 months ago

Hello, @olafz, upon reviewing our current plugin code, I noticed we already have a github_repository_collaborator table. This should help us gather information on collaborators (OUTSIDE, DIRECT, and ALL) for a repository. You can specify the type of collaborators you're interested in by using the affiliation option in the where clause, with ALL being the default setting. For example: select * from github_repository_collaborator where repository_full_name = 'turbot/steampipe-plugin-aws' and affiliation = 'outside'. When querying github_repository_collaborator, providing the repository_full_name is mandatory, while specifying the affiliation parameter is optional (with possible values: direct, outside, and all).

Please try this approach and let me know if it's beneficial.

Thank you!

olafz commented 6 months ago

Hi @ParthaI,

Thanks for your feedback. Your suggestion of using the github_repository_collaborator does work, but I have an use case which makes using this table extremely inefficient.

I would like to create an overview of all external collaborators in a GitHub organization. If I were to use the github_repository_collaborator table, I would have to create a query that iterates (join) over all repositories in the GitHub organization (since repository_full_name is a mandatory parameter).

In our case - because of the size of the organization - that would mean going over thousands and thousands of repositories, where most of them don't even have an outside collaborator. It would be much more efficient if I were able to do it the other way around: first query for all outside collaborators and then go from there.

Thanks again for looking into this issue and I hope this creates some clarity. Also: if there is another way to achieve this efficiently, let me know :)

ParthaI commented 6 months ago

Hello, @olafz, I was trying to achieve the goal as per the comments provided in the in the issue description.

However, I am facing a few issues while achieving the goal. I found some articles and community discussion related to my issue. I will give it a try once more.

I constructed a query to get outside collaborators at the organization level. Could you please give it a try in Github GraphQL Explorer and share your feedback with us if the following GraphQL query works as expected?

    query {
        organization(login: "<YOUR ORG NAME>") {
            url
            login

            repositories(first: 100) {
                pageInfo {
                    endCursor
                    hasNextPage
                }

                nodes {
                    name
                    collaborators(affiliation: OUTSIDE) {
                        edges {
                            permission
                        }
                        nodes {
                            url
                            login
                            name
                            email
                            company
                        }
                    }
                }
            }
        }
    }

Thank You!

olafz commented 6 months ago

Hello @ParthaI,

Thanks for the follow-up. I tested the query you provided and it seems to do the job; it looks pretty good.

See below the example from the first response for one repository with external collaborators:

{
  "data": {
    "organization": {
      "url": "https://github.com/<YOUR ORG NAME>",
      "login": "<YOUR ORG NAME>",
      "repositories": {
        "pageInfo": {
          "endCursor": "...",
          "hasNextPage": true
        },
        "nodes": [
          {
            "name": "<REPOSITORY NAME>",
            "collaborators": {
              "edges": [
                {
                  "permission": "READ"
                },
                {
                  "permission": "WRITE"
                }
              ],
              "nodes": [
                {
                  "url": "https://github.com/<SOME OUTSIDE COLLABORATOR>",
                  "login": "<SOME OUTSIDE COLLABORATOR>",
                  "name": "...",
                  "email": "...",
                  "company": "..."
                },
                {
                  "url": "https://github.com/<SOME OTHER OUTSIDE COLLABORATOR>",
                  "login": "<SOME OTHER OUTSIDE COLLABORATOR>",
                  "name": "...",
                  "email": "...",
                  "company": null
                }
              ]
            }
          },
          ...
ParthaI commented 6 months ago

Thank you, @olafz, for verifying that the GraphQL query works well in the GitHub GraphQL Explorer. I'll proceed with the required adjustments on the plugin side and inform you as soon as the modifications are complete.

I appreciate your cooperation!

ParthaI commented 6 months ago

Hello @olafz, I've introduced a new table called github_organization_collaborator in the issue-404 branch.

Table Usage:

Unfortunately, I was unable to test the changes due to missing organization-level permissions. Your assistance in testing these changes on the PR branch and providing feedback would be greatly appreciated.

Testing Steps in the PR Branch:

  1. Clone the repository: git clone https://github.com/turbot/steampipe-plugin-github.git
  2. Navigate to the directory: cd steampipe-plugin-github
  3. Switch to the issue-404 branch: git checkout issue-404
  4. Execute make
  5. Run your query.

Thank you!

olafz commented 6 months ago

Hi @ParthaI,

Awesome; thanks! I'm going to give it a try somewhere in the next couple of days.

ParthaI commented 5 months ago

Hello, @olafz, have you had the opportunity to try it out yet?

olafz commented 5 months ago

Hello @ParthaI,

Thank you for your patience. I've been testing this branch, but unfortunately it did not work. The following query:

> SELECT * FROM github_organization_collaborator WHERE organization = '...';

Consistently results in

Error: github: non-200 OK status code: 502 Bad Gateway body:
{
    "data": null,
    "errors":
    [
        {
            "message":"Something went wrong while executing your query. This may be the result of a timeout, or it could be a GitHub bug. Please include `86E8:...:...:...:...` when reporting this issue."
        }
    ]
}

The GraphQL results still work, so I suspect Steampipe to do something inefficient? I've contacted GitHub support; maybe they can help figuring out what the cause is.

olafz commented 5 months ago

I've gotten a response from GitHub:

Our logs indicate that you're hitting our timeout threshold when fetching the data. All API requests, both for the REST API and GraphQL API, have a 10 second limit on execution time. If that limit is reached for a request, the request is terminated and you get back that error.

One of the things I notice, is that in other GitHub related Steampipe queries, I see a Loading results quickly after the query has been issued, increasing in number:

image

This goes on until there is an output. I don't see anything like that in the github_organization_collaborator query. In an attempt to further debug the issue, I ran the earlier query, but this time with a limit:

> SELECT * FROM github_organization_collaborator WHERE organization = '...' LIMIT 1;

That gave a clue:

Error: github: unable to parse hydrate item {0xc0006529f0 ADMIN {{992 MDQ6VXNlcjk5Mg== Olaf van Zandwijk} 
olafz  2008-02-26 21:22:23 +0000 UTC 2024-02-05 15:43:50 +0000 UTC https://github.com/olafz}} 
as an User (SQLSTATE HV000)
ParthaI commented 5 months ago

Hi @olafz, apologies for the late reply, and I greatly appreciate your patience.

I've submitted a PR to address this issue. The table should now return the results correctly.

Thanks!

olafz commented 5 months ago

Hi @ParthaI,

The new version works better, but still not good :) The query now returns results as long as I limit the results to a relatively low amount:

> SELECT * FROM github_organization_collaborator WHERE organization = '...' LIMIT 50;

Without a limit, or with a larger limit (> 100), I still see the same behavior as before:

I still observe the progress for this query differently compared to other GitHub queries. They show a progress indicator:

Other queries: image

This query just shows Executing query... for a while and then fails:

> SELECT * FROM github_organization_collaborator WHERE organization = '...' LIMIT 100;
image
Error: github: non-200 OK status code: 502 Bad Gateway body: "..." (SQLSTATE HV000)
ParthaI commented 5 months ago

Hello @olafz, apologies for the snag!!

Could you please raise a separate issue as per the issue comment? I will take a look.

Please include the following details in the new issue body:

Are you aware of the rough number of collaborators within your organization? The issue might stem from the extensive datasets in the org. I'll attempt to replicate it locally by tweaking the plugin code.

Could you describe how the query SELECT * FROM github_organization_collaborator WHERE organization = '...' and affiliation = 'OUTSIDE' behaves, both with and without the limit clause?

Thanks!

olafz commented 5 months ago

I've opened a new issue here: https://github.com/turbot/steampipe-plugin-github/issues/413

Thanks for the assistance!