softwaresaved / rse-repo-analysis

Study of research software in repositories. Contact: @karacolada
BSD 3-Clause "New" or "Revised" License
11 stars 0 forks source link

Page numbers #24

Closed karacolada closed 1 year ago

karacolada commented 1 year ago

Check distribution of page numbers the extracted links were found on. Decide how to proceed depending on the outcome.

karacolada commented 1 year ago

Distribution of github links that are on the first 10 pages:

image

karacolada commented 1 year ago

Of those that are on page 0, 10 were not research paper repos and 22 were.

karacolada commented 1 year ago

Of those that are on page 1, 32 were research paper repos and 24 were not. The papers with correct repo links on page 1 either had a title page with no content or a longer introduction that stretched into the second page.

karacolada commented 1 year ago
karacolada commented 1 year ago

Currently blocked: analysis run so far removed duplicates of GitHub links within one paper. Rerunning now.

karacolada commented 1 year ago

Found only 10 repository links that were duplicated within the same publication, of which only one was mentioned on the first 2 pages. However, all of this repositories were real RSE repositories created for the publication.

karacolada commented 1 year ago

Unsure how to continue with this. The duplication analysis suggests that there are quite a few links mentioned on later pages, too, which we are missing with our current filtering method.

Manually looking at the publications, I have made the following observations:

karacolada commented 1 year ago

Of those that are on page 0, 10 were not research paper repos and 22 were.

Reran analysis on newer dataset. Out of 57, 14 were false positives.

karacolada commented 1 year ago

Of those that are on page 1, 32 were research paper repos and 24 were not. The papers with correct repo links on page 1 either had a title page with no content or a longer introduction that stretched into the second page.

Reran analysis on newer dataset. Out of 78, 29 were false positives.