sotorrent / db-scripts

SQL and Bash scripts to import the offical Stack Overflow data dump and the SOTorrent data set, to retrieve Stack Overflow references from the BigQuery GitHub data set, and to retrieve data from the SOTorrent dataset for analysis.
Apache License 2.0
14 stars 7 forks source link

Inconsistent identification of PostTypeId in PostReferenceGH #16

Closed CosmosAtlas closed 5 years ago

CosmosAtlas commented 5 years ago

temp

In the table PostReferenceGH.csv, some entries are self conflicting by PostId and PostTypeId. Example given above, the last row with same PostId identified with different PostTypeId. This issue exists in more than 1 row.

CosmosAtlas commented 5 years ago

Just for better reviewability, attached the problematic GHUrl to the misclassified row.

https://raw.githubusercontent.com/Demonware/cog/trunk/cog/util.py

sbaltes commented 5 years ago

Thanks for reporting! I'm currently investigating the issue and will get back to you soon.

sbaltes commented 5 years ago

The PostTypeId in table PostReferenceGH is based on the link found in the corresponding file. In the file you linked, the user simply posted the wrong link (using a instead of q). However, Stack Overflow directs such requests to the correct thread.

I thought about using data from table Posts to set the PostTypeId correctly, but than that column would be inconsistent with the matched content. Therefore, if you don't have further thoughts on this, I'll to keep the current behavior, but add a comment to the SOTorrent project page, mentioning that column PostTypeId in table PostReferenceGH is based on the extracted link, not on the data from Stack Overflow.

CosmosAtlas commented 5 years ago

My take on this is that PostTypeId and PostId have a one to one relation. If remaining the current way, there contain errors in the table.

Since the SOUrl is often not the actual matched Url anyway (actual matched is in GHMatches.csv, why not have the correct SOUrl and PostTypeId in PostReferenceGH.csv?

sbaltes commented 5 years ago

My take on this is that PostTypeId and PostId have a one to one relation. If remaining the current way, there contain errors in the table.

Yes, but those inconsistencies are caused by GitHub users posting wrong Stack Overflow links. The question is whether those issues should be fixed by the SOTorrent extraction scripts. For example, I also don't check whether the PostIds are valid.

Since the SOUrl is often not the actual matched Url anyway (actual matched is in GHMatches.csv, why not have the correct SOUrl and PostTypeId in PostReferenceGH.csv?

The SOUrls are normalized to sharing links. There is no one-to-one connection between the individual extracted URLs in PostReferenceGH and the matches in GHMatches. Those tables are only connected via column FileId. One line may contain several links pointing to Stack Overflow posts.

I will consider fixing broken PostTypeIds or excluding invalidPostIds in an upcoming release. Until then, feel free to ignore column PostTypeId. You can always join PostReferenceGH and Posts based in PostId to retrieve the correct PostTypeId.

sbaltes commented 5 years ago

Also, we use PostTypeId 99 to mark links to comments, which does not exist in the official data dump. To avoid confusion, it may be better to drop column PostTypeId for table PostReferenceGH. Comments can also identified by column CommentId not being NULL and for posts, the PostTypeId can be retrieved from table Posts.

CosmosAtlas commented 5 years ago

Yes, but those inconsistencies are caused by GitHub users posting wrong Stack Overflow links. The question is whether those issues should be fixed by the SOTorrent extraction scripts. For example, I also don't check whether the PostIds are valid.

I can confirm that it's true, there are PostId from PostReferenceGH.csv that have no entry in Post.XML

I guess it's a really hard problem to format the data then. Maybe better documentation on the subject could help people have a more thourough understanding of the dataset.

Thanks for the fast update on the problem!

sbaltes commented 5 years ago

No worries and thanks for reporting this! I'll try to figure out how many PostIds and CommentIds are invalid, maybe I can filter them out even in the current release. The upload to Zenodo is not finished yet and I can always modify the BigQuery version. And you are right: I should update the documentation afterwards.

sbaltes commented 5 years ago

About 0.36% of the links in table PostReferenceGH have either an invalid PostId or an invalid CommentId. I used the distinct Ids in table Comment and the distinct PostIds in table PostHistory to check this. The latter implies that there could still be some PostIds that are not in table Posts (e.g., deleted posts with history in PostHistory).

I decided to remove column PostTypeId and also the custom id 99 for comments. To retrieve the PostTypeId of links pointing to questions or answers, one could use:

SELECT links.*, PostTypeId
FROM `sotorrent-org.2019_03_17.PostReferenceGH` links
JOIN `sotorrent-org.2019_03_17.Posts` posts
ON links.PostId = posts.Id
WHERE CommentId IS NULL;

I further added a column PostIds to table GHMatches, containing a space-separated list of post ids found in the matched line.