Closed CosmosAtlas closed 5 years ago
Just for better reviewability, attached the problematic GHUrl to the misclassified row.
https://raw.githubusercontent.com/Demonware/cog/trunk/cog/util.py
Thanks for reporting! I'm currently investigating the issue and will get back to you soon.
The PostTypeId
in table PostReferenceGH
is based on the link found in the corresponding file. In the file you linked, the user simply posted the wrong link (using a
instead of q
). However, Stack Overflow directs such requests to the correct thread.
I thought about using data from table Posts
to set the PostTypeId
correctly, but than that column would be inconsistent with the matched content. Therefore, if you don't have further thoughts on this, I'll to keep the current behavior, but add a comment to the SOTorrent project page, mentioning that column PostTypeId
in table PostReferenceGH
is based on the extracted link, not on the data from Stack Overflow.
My take on this is that PostTypeId
and PostId
have a one to one relation. If remaining the current way, there contain errors in the table.
Since the SOUrl
is often not the actual matched Url anyway (actual matched is in GHMatches.csv
, why not have the correct SOUrl
and PostTypeId
in PostReferenceGH.csv
?
My take on this is that PostTypeId and PostId have a one to one relation. If remaining the current way, there contain errors in the table.
Yes, but those inconsistencies are caused by GitHub users posting wrong Stack Overflow links. The question is whether those issues should be fixed by the SOTorrent extraction scripts. For example, I also don't check whether the PostId
s are valid.
Since the SOUrl is often not the actual matched Url anyway (actual matched is in GHMatches.csv, why not have the correct SOUrl and PostTypeId in PostReferenceGH.csv?
The SOUrl
s are normalized to sharing links.
There is no one-to-one connection between the individual extracted URLs in PostReferenceGH
and the matches in GHMatches
. Those tables are only connected via column FileId
. One line may contain several links pointing to Stack Overflow posts.
I will consider fixing broken PostTypeId
s or excluding invalidPostId
s in an upcoming release. Until then, feel free to ignore column PostTypeId
. You can always join PostReferenceGH
and Posts
based in PostId
to retrieve the correct PostTypeId
.
Also, we use PostTypeId
99 to mark links to comments, which does not exist in the official data dump.
To avoid confusion, it may be better to drop column PostTypeId
for table PostReferenceGH
. Comments can also identified by column CommentId
not being NULL
and for posts, the PostTypeId
can be retrieved from table Posts
.
Yes, but those inconsistencies are caused by GitHub users posting wrong Stack Overflow links. The question is whether those issues should be fixed by the SOTorrent extraction scripts. For example, I also don't check whether the
PostId
s are valid.
I can confirm that it's true, there are PostId
from PostReferenceGH.csv
that have no entry in Post.XML
I guess it's a really hard problem to format the data then. Maybe better documentation on the subject could help people have a more thourough understanding of the dataset.
Thanks for the fast update on the problem!
No worries and thanks for reporting this!
I'll try to figure out how many PostId
s and CommentId
s are invalid, maybe I can filter them out even in the current release. The upload to Zenodo is not finished yet and I can always modify the BigQuery version. And you are right: I should update the documentation afterwards.
About 0.36% of the links in table PostReferenceGH
have either an invalid PostId
or an invalid CommentId
. I used the distinct Id
s in table Comment
and the distinct PostId
s in table PostHistory
to check this. The latter implies that there could still be some PostId
s that are not in table Posts
(e.g., deleted posts with history in PostHistory
).
I decided to remove column PostTypeId
and also the custom id 99 for comments. To retrieve the PostTypeId
of links pointing to questions or answers, one could use:
SELECT links.*, PostTypeId
FROM `sotorrent-org.2019_03_17.PostReferenceGH` links
JOIN `sotorrent-org.2019_03_17.Posts` posts
ON links.PostId = posts.Id
WHERE CommentId IS NULL;
I further added a column PostIds
to table GHMatches
, containing a space-separated list of post ids found in the matched line.
In the table
PostReferenceGH.csv
, some entries are self conflicting byPostId
andPostTypeId
. Example given above, the last row with samePostId
identified with differentPostTypeId
. This issue exists in more than 1 row.