Closed Alfusainey closed 5 years ago
oh, I was able to see from the PostType
table that 99 is a comment and the SO link confirms that too.
I just have a follow up question: If the link points to a comment, can we distinguish if it points to a code block or a text block (in the comment itself), similar to what we have for Posts?
I added the PostTypeId
99 as a workaround to represent links to comments in table PostReferenceGH
. It's not part of the official Stack Overflow data dump.
As far as I know, comments cannot contain code blocks, only inline code. You could use something like this to split the comments into the code and non-code parts:
SELECT
c.Id AS CommentId,
Offset+1 AS LocalId,
Block
FROM (SELECT DISTINCT CommentId FROM `sotorrent-org.2018_12_09.PostReferenceGH`) ref
JOIN `sotorrent-org.2018_12_09.Comments` c
ON ref.CommentId = c.Id
JOIN UNNEST(SPLIT(Text, '`')) AS Block
WITH OFFSET AS Offset
ORDER BY CommentId, LocalId;
waaw, amazing! I ran the above query on bigquery, save it into a new table and queried the comment with id 30165888
and the results:
the file that reference the comment directly copied that snippet :-)
Thank you!
You're welcome :-)
BTW: An empty block with LocalId
1 indicates that the first block is a text block (empty split before the first backtick), an empty last block indicates that the last block is a code block (empty split after the last backtick).
Similar to the post blocks, there may be other variants of inline code in the comments (in addition to the standard backtick approach), see these patterns for examples from the posts.
If you are only interested in the code, you could also use this approach:
SELECT
c.Id AS CommentId,
REGEXP_EXTRACT_ALL(Text, r"`([^`]+)`") as Code
FROM (SELECT DISTINCT CommentId FROM `sotorrent-org.2018_12_09.PostReferenceGH`) ref
JOIN `sotorrent-org.2018_12_09.Comments` c
ON ref.CommentId = c.Id;
thanks @sbaltes, thats really helpful! yes, i am mostly interested in the code blocks
BTW: other than the official bigquery docs, do you have good pointers for learning BQ?
Except for the official documentation, I'm aware of the following resources:
thanks! one last question (next time I will open a new issue): what does the column Copies
in the PostReferenceGH
table means?
@sbaltes: In the 2018_12_09 dataset, I saw 99 as the value of
PostTypeId
in thePostReferenceGH
table, however, thePostType
table only contain values between 1-8. What post type is 99?