sotorrent / db-scripts

SQL and Bash scripts to import the offical Stack Overflow data dump and the SOTorrent data set, to retrieve Stack Overflow references from the BigQuery GitHub data set, and to retrieve data from the SOTorrent dataset for analysis.
Apache License 2.0
14 stars 7 forks source link

Values of PostTypeId Column in PostReferenceGH #15

Closed Alfusainey closed 5 years ago

Alfusainey commented 5 years ago

@sbaltes: In the 2018_12_09 dataset, I saw 99 as the value of PostTypeId in the PostReferenceGH table, however, the PostType table only contain values between 1-8. What post type is 99?

Alfusainey commented 5 years ago

oh, I was able to see from the PostType table that 99 is a comment and the SO link confirms that too.

I just have a follow up question: If the link points to a comment, can we distinguish if it points to a code block or a text block (in the comment itself), similar to what we have for Posts?

sbaltes commented 5 years ago

I added the PostTypeId 99 as a workaround to represent links to comments in table PostReferenceGH. It's not part of the official Stack Overflow data dump.

As far as I know, comments cannot contain code blocks, only inline code. You could use something like this to split the comments into the code and non-code parts:

SELECT
  c.Id AS CommentId,
  Offset+1 AS LocalId,
  Block
FROM (SELECT DISTINCT CommentId FROM `sotorrent-org.2018_12_09.PostReferenceGH`) ref
JOIN `sotorrent-org.2018_12_09.Comments` c
ON ref.CommentId = c.Id
JOIN UNNEST(SPLIT(Text, '`')) AS Block
WITH OFFSET AS Offset
ORDER BY CommentId, LocalId;
Alfusainey commented 5 years ago

waaw, amazing! I ran the above query on bigquery, save it into a new table and queried the comment with id 30165888 and the results:

screen shot 2019-02-11 at 16 53 23

the file that reference the comment directly copied that snippet :-)

Thank you!

sbaltes commented 5 years ago

You're welcome :-)

BTW: An empty block with LocalId 1 indicates that the first block is a text block (empty split before the first backtick), an empty last block indicates that the last block is a code block (empty split after the last backtick).

Similar to the post blocks, there may be other variants of inline code in the comments (in addition to the standard backtick approach), see these patterns for examples from the posts.

sbaltes commented 5 years ago

If you are only interested in the code, you could also use this approach:

SELECT
  c.Id AS CommentId,
  REGEXP_EXTRACT_ALL(Text, r"`([^`]+)`") as Code
FROM (SELECT DISTINCT CommentId FROM `sotorrent-org.2018_12_09.PostReferenceGH`) ref
JOIN `sotorrent-org.2018_12_09.Comments` c
ON ref.CommentId = c.Id;

2019-02-11 19_18_41-window

Alfusainey commented 5 years ago

thanks @sbaltes, thats really helpful! yes, i am mostly interested in the code blocks

BTW: other than the official bigquery docs, do you have good pointers for learning BQ?

sbaltes commented 5 years ago

Except for the official documentation, I'm aware of the following resources:

Alfusainey commented 5 years ago

thanks! one last question (next time I will open a new issue): what does the column Copies in the PostReferenceGH table means?