sotorrent / db-scripts

SQL and Bash scripts to import the offical Stack Overflow data dump and the SOTorrent data set, to retrieve Stack Overflow references from the BigQuery GitHub data set, and to retrieve data from the SOTorrent dataset for analysis.
Apache License 2.0
14 stars 7 forks source link

help about PostReferenceGH #6

Closed mashalahmad closed 6 years ago

mashalahmad commented 6 years ago

hey could you please tell what does Copies attribute means in PostRefernceGH table?

and how can I get to know that a class of GitHub has many clones from stack overflow?

I read you paper Attribution Required: Stack Overflow Code Snippets in GitHub Projects where you use CPD to detect the clones. is it suitable? or is there anyway to get it from PostRefernceGH table.

sbaltes commented 6 years ago

could you please tell what does Copies attribute means in PostRefernceGH table?

Sure, Copies indicates how often that exact file appears in the dataset. For a certain FileId, it is equal to:

SELECT COUNT(*)
FROM `sotorrent-org.2018_09_23.PostReferenceGH`
WHERE FileId="<FILE_ID>";

and how can I get to know that a class of GitHub has many clones from stack overflow? I read you paper Attribution Required: Stack Overflow Code Snippets in GitHub Projects where you use CPD to detect the clones. is it suitable? or is there anyway to get it from PostRefernceGH table.

Unfortunately, I can only provide support for the dataset here. It's up to you to find a suitable approach to detect the code clones. You could use CPD, but most likely only on a sample of projects and snippets. However, there are many other code clone detectors available. You could start with these papers:

The corresponding full paper for the ICSE extended abstract you mentioned is now also available: