pingcap / ossinsight

Analysis, Comparison, Trends, Rankings of Open Source Software, you can also get insight from more than 7 billion with natural language (powered by OpenAI). Follow us on Twitter: https://twitter.com/ossinsight
https://ossinsight.io/
Apache License 2.0
1.79k stars 333 forks source link

Blank commit_ids #939

Open llk23r opened 1 year ago

llk23r commented 1 year ago

⚠️ Issue: It looks like commit_id is not stored properly.

Query:

select commit_id from github_events where repo_id = 1181927 order by event_day desc limit 10;

Response:

{
    "requestedAt": "2022-11-05T09:20:25.860+00:00",
    "finishedAt": "2022-11-05T09:20:28.086+00:00",
    "spent": 2.226,
    "sql": "SELECT `commit_id` FROM `github_events` WHERE `repo_id` = 1181927 ORDER BY `event_day` DESC LIMIT 10",
    "fields": [
        {
            "name": "commit_id"
        }
    ],
    "data": [
        {
            "commit_id": ""
        },
        {
            "commit_id": ""
        },
        {
            "commit_id": ""
        },
        {
            "commit_id": ""
        },
        {
            "commit_id": ""
        },
        {
            "commit_id": ""
        },
        {
            "commit_id": ""
        },
        {
            "commit_id": ""
        },
        {
            "commit_id": ""
        },
        {
            "commit_id": ""
        }
    ]
}

Blank UI:

Screenshot 2022-11-05 at 3 09 28 PM

Another example:

Query:

SELECT
  DISTINCT commit_id
FROM
  github_events
WHERE
  repo_id = 100401665;

Response:

{
    "requestedAt": "2022-11-05T09:54:45.242+00:00",
    "finishedAt": "2022-11-05T09:54:45.286+00:00",
    "spent": 0.044,
    "sql": "SELECT DISTINCT `commit_id` FROM `github_events` WHERE `repo_id` = 100401665 LIMIT 100",
    "fields": [
        {
            "name": "commit_id"
        }
    ],
    "data": [
        {
            "commit_id": ""
        }
    ]
}

Possible cause:

Example Github event:

{
    "id": "25044688494",
    "type": "PushEvent",
    "actor": {
        "id": 5184102,
        "login": "CocoaPodsBot",
        "display_login": "CocoaPodsBot",
        "gravatar_id": "",
        "url": "https://api.github.com/users/CocoaPodsBot",
        "avatar_url": "https://avatars.githubusercontent.com/u/5184102?"
    },
    "repo": {
        "id": 2365549,
        "name": "CocoaPods/Specs",
        "url": "https://api.github.com/repos/CocoaPods/Specs"
    },
    "payload": {
        "push_id": 11561212689,
        "size": 1,
        "distinct_size": 1,
        "ref": "refs/heads/master",
        "head": "dad5d1730d00b686775e038cc7428692867fd849",
        "before": "62a80dcdc175f423b5cb6d4f6a2e9c5dd6030304",
        "commits": [
            {
                "sha": "dad5d1730d00b686775e038cc7428692867fd849",
                "author": {
                    "email": "riveraladinojorgeluis@gmail.com",
                    "name": "Jorge Luis Rivera Ladino"
                },
                "message": "[Add] CommonUIKitComponents 1.2.3",
                "distinct": true,
                "url": "https://api.github.com/repos/CocoaPods/Specs/commits/dad5d1730d00b686775e038cc7428692867fd849"
            }
        ]
    },
    "public": true,
    "created_at": "2022-11-05T04:41:23Z",
    "org": {
        "id": 1189714,
        "login": "CocoaPods",
        "gravatar_id": "",
        "url": "https://api.github.com/orgs/CocoaPods",
        "avatar_url": "https://avatars.githubusercontent.com/u/1189714?"
    }
}

I'd assume that the commit_id be picked from payload.commits[][sha] but looks like it instead gets picked from

event.dig("payload", "comment", "commit_id") || ''

at: https://github.com/pingcap/ossinsight/blob/22356832e00106162d9516a435db398cb9cbcbb7/etl/config/initializers/fetch_event.rb#L64?

hooopo commented 1 year ago

For the CommitCommentEvent, the commit_id is the commented commit, and this data is correct.

For PushEvent, we really should extract commit_ids or commit_shas, this field will be added later, thank you for your reminder 👍

id                     | 22127996623
type                   | CommitCommentEvent
created_at             | 2022-06-02 15:23:40
repo_id                | 498029471
repo_name              | luis-unlam/Trabajo-React-GM2-
actor_id               | 35613825
actor_login            | vercel[bot]
language               |
additions              | 0
deletions              | 0
action                 |
number                 | 0
commit_id              | b48fee47ee26207a358186d89a2955618c60fb2c
comment_id             | 75183688
org_login              |
org_id                 | 0
state                  |
closed_at              | 1970-01-01 00:00:00
comments               | 0
pr_merged_at           | 1970-01-01 00:00:00
pr_merged              | 0
pr_changed_files       | 0
pr_review_comments     | 0
pr_or_issue_id         | 0
event_day              | 2022-06-02
event_month            | 2022-06-01
event_year             | 2022
push_size              | 0
push_distinct_size     | 0
creator_user_login     | vercel[bot]
creator_user_id        | 35613825
pr_or_issue_created_at | 1970-01-01 00:00:00
llk23r commented 1 year ago

this field will be added later

Can the existing field not be used? Why add a new field altogether and not reuse the existing field?

Assuming that CommitCommentEvent and PushEvent are mutually exclusive, using the same commit_id field should be possible, assuming the rest of the fields are redundant, i.e. multiple rows with same id but different commit_id. Looking at the schema, the id is not-unique.