Page numbers - Githubissues

softwaresaved / rse-repo-analysis

Study of research software in repositories. Contact: @karacolada

BSD 3-Clause "New" or "Revised" License

11 stars 0 forks source link

Page numbers #24

Closed karacolada closed 1 year ago

karacolada commented 1 year ago

Check distribution of page numbers the extracted links were found on. Decide how to proceed depending on the outcome.

karacolada commented 1 year ago

Distribution of github links that are on the first 10 pages:

karacolada commented 1 year ago

Of those that are on page 0, 10 were not research paper repos and 22 were.

karacolada commented 1 year ago

Of those that are on page 1, 32 were research paper repos and 24 were not. The papers with correct repo links on page 1 either had a title page with no content or a longer introduction that stretched into the second page.

karacolada commented 1 year ago

[x] check for duplications in links that are mentioned both at start and end, those might be referenced repos to filter out?

karacolada commented 1 year ago

Currently blocked: analysis run so far removed duplicates of GitHub links within one paper. Rerunning now.

karacolada commented 1 year ago

Found only 10 repository links that were duplicated within the same publication, of which only one was mentioned on the first 2 pages. However, all of this repositories were real RSE repositories created for the publication.

karacolada commented 1 year ago

Unsure how to continue with this. The duplication analysis suggests that there are quite a few links mentioned on later pages, too, which we are missing with our current filtering method.

Manually looking at the publications, I have made the following observations:

usually, links are found either in a footnote or in full text
- footnotes can have links to used software as well, though I think the first (upper) one is usually the "real" repo, but only if they actually have a repo for their work, so... tricky.
- full text should be easier to recognise from context: it will have "available", "open source", "this project" etc. somewhere around it
references might be in early pages, but sometimes also happen somewhere in the middle (e.g. as footnote to methods section) or towards the end (e.g. as endnote or availability section)

karacolada commented 1 year ago

Of those that are on page 0, 10 were not research paper repos and 22 were.

Reran analysis on newer dataset. Out of 57, 14 were false positives.

karacolada commented 1 year ago

Of those that are on page 1, 32 were research paper repos and 24 were not. The papers with correct repo links on page 1 either had a title page with no content or a longer introduction that stretched into the second page.

Reran analysis on newer dataset. Out of 78, 29 were false positives.