igrigorik / gharchive.org

GH Archive is a project to record the public GitHub timeline, archive it, and make it easily accessible for further analysis.
https://www.gharchive.org
MIT License
2.68k stars 206 forks source link

Labels per Issue #184

Closed cptanalatriste closed 8 months ago

cptanalatriste commented 6 years ago

I'm trying to run some analysis tasks over the labels an issue gets attached. I started running this query on BigQuery:

SELECT repo.name, JSON_EXTRACT_SCALAR(payload, '$.issue.labels') labels
FROM [githubarchive:day.20150101] 
WHERE type IN ('IssuesEvent')
AND JSON_EXTRACT_SCALAR(payload, '$.action') = 'opened'
AND JSON_EXTRACT_SCALAR(payload, '$.issue.labels') is not null
LIMIT 50

And I obtained no results, even if I remove the action filter. Does this information get mined in the dataset?

igrigorik commented 6 years ago

Carlos, have you verified that issue labels are present in the data? The Events API does not (unfortunately) surface all the issue metadata (see related discussion in https://github.com/igrigorik/gharchive.org/issues/183) so I wouldn't be surprised if this data is missing.

/cc @annafil

annafil commented 6 years ago

@cptanalatriste wanted to follow up here and echo what @igrigorik said in case you are still working on this issue -- I believe issue labels are not included in the /events API feed coming from GitHub (just webhooks), so I would not expect to see them in the GHArchive data set.

raghavsehtia commented 6 years ago

Hey! I believe the information is there (at least some of it), however since it's nested, you must index into the field in order to get the actual values. I ran the following query SELECT repo.name, JSON_EXTRACT_SCALAR(payload, '$.issue.labels[0].name') label1, JSON_EXTRACT_SCALAR(payload, '$.issue.labels[1].name') label2, JSON_EXTRACT_SCALAR(payload, '$.issue.labels[2].name') labels3 FROM [githubarchive:day.20150101] WHERE type IN ('IssuesEvent') AND JSON_EXTRACT_SCALAR(payload, '$.action') = 'opened' LIMIT 50 and my result is attached.

bg-final

Please let me know if this is incorrect or if I may have misunderstood the problem!

nkorojoseph commented 6 years ago

Sorry this reply is coming late.. I will check it out tomorrow, please

On Tue, 5/29/18, raghavsehtia notifications@github.com wrote:

Subject: Re: [igrigorik/gharchive.org] Labels per Issue (#184) To: "igrigorik/gharchive.org" gharchive.org@noreply.github.com Cc: "Subscribed" subscribed@noreply.github.com Date: Tuesday, May 29, 2018, 5:56 PM

Hey!

I believe the information is there (at least some of it), however since it's nested, you must index into the field in order to get the actual values. I ran the following query

SELECT repo.name, JSON_EXTRACT_SCALAR(payload, '$.issue.labels[0].name') label1, JSON_EXTRACT_SCALAR(payload, '$.issue.labels[1].name') label2, JSON_EXTRACT_SCALAR(payload, '$.issue.labels[2].name') labels3 FROM [githubarchive:day.20150101] WHERE type IN ('IssuesEvent') AND JSON_EXTRACT_SCALAR(payload, '$.action') = 'opened' LIMIT 50

and my result is attached.

Please let me know if this is incorrect or if I may have misunderstood the problem!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.