softwaresaved / rse-repo-analysis

Study of research software in repositories. Contact: @karacolada
BSD 3-Clause "New" or "Revised" License
12 stars 0 forks source link

Investigate cutoff in README files #37

Closed karacolada closed 1 year ago

karacolada commented 1 year ago

The current cutoff for the README size histogram is based on statistical bins. Instead, investigate the cutoff between an automatically generated README, one with one small section, and a large, full README.

karacolada commented 1 year ago

Settled on following bins:

Bytes bin meaning (low, high) examples
0-1 No README eghbal11/Eghbal, supercollider-quarks/Republic
1-300 Ultra-short (title and description from repo creation) epsilonlabs/emf-cbp, brunomozza/IoTSecurityOntology
300 - 1500 Short description lphowell/Geothermal-Modelling, oreindt/routes-rumours-ml3
1500 - 10000 Informative README sanket0707/GNN-Mixer, ok1zjf/lbae
10000 - Highly detailed uos/mesh_navigation, stuartemiddleton/glosat_table_dataset

They aren't perfect, but give a rough structure. Note that the listed examples are from the edges of the bins, not the middle, so they seem similar sometimes.

karacolada commented 1 year ago

High-interest repos have more informative READMEs. Repos with short description-type READMEs do not generate high interest. On the other hand, an informative README does not guarantee high interest: the proportion of informative ones is comparable across all interest categories.