Closed karacolada closed 1 year ago
Distribution of github links that are on the first 10 pages:
Of those that are on page 0, 10 were not research paper repos and 22 were.
Of those that are on page 1, 32 were research paper repos and 24 were not. The papers with correct repo links on page 1 either had a title page with no content or a longer introduction that stretched into the second page.
Currently blocked: analysis run so far removed duplicates of GitHub links within one paper. Rerunning now.
Found only 10 repository links that were duplicated within the same publication, of which only one was mentioned on the first 2 pages. However, all of this repositories were real RSE repositories created for the publication.
Unsure how to continue with this. The duplication analysis suggests that there are quite a few links mentioned on later pages, too, which we are missing with our current filtering method.
Manually looking at the publications, I have made the following observations:
Of those that are on page 0, 10 were not research paper repos and 22 were.
Reran analysis on newer dataset. Out of 57, 14 were false positives.
Of those that are on page 1, 32 were research paper repos and 24 were not. The papers with correct repo links on page 1 either had a title page with no content or a longer introduction that stretched into the second page.
Reran analysis on newer dataset. Out of 78, 29 were false positives.
Check distribution of page numbers the extracted links were found on. Decide how to proceed depending on the outcome.