Closed lucyb closed 3 months ago
From Job Server ReleaseFiles, categorise by extension:
extension_categories = {
'html':"report",
'png':"image",
'csv':"tabular",
'json':"tabular",
'jpeg':"image",
'txt':"tabular",
'svg':"image",
'jpg':"image",
'gz':"other",
'ods':"tabular",
'log':"other",
'ipynb':"report",
'pdf':"report",
'db':"other",
'eps':"image",
'tiff':"image",
'csv#':"tabular",
'xls':"tabular"}
Then group by project id, file category, count number of files. For each project, calculate the proportion of "finished" (image and report) files of total files released (excluding "other"):
count 92.000000
mean 0.328568
std 0.322190
min 0.000000
25% 0.000000
50% 0.247652
75% 0.542579
max 1.000000
So you could say that on average, 25% of the released outputs of a project are those that could have been produced post-release. Now, this is clearly a bit wrong because tabular data itself is an output that may be used in a manuscript, and there will be instances where both a tabular and graphical representation of the same data are released.
27 (of 92) projects have only ever released tabular data.
Any one got any more ideas how this could be analysed?
I think breaking down over time could be helpful.
I would expect more images to have been released in early OS life, and less as time has gone on (based on the assumption that users have realised they don't need to, and its a PITA to do so).
What's the dotted line?
Am also suspicious we can conclude much from the downward trend, because we've been so much quieter for the last 6mo?
I couldn't be bothered fighting with matplotlib to get it in the same legend, sorry! It's the proportion
Yep, I think that we shouldn't conclude there's been a significant downward trend in the proportion of image/report outputs in the past 12mo or so; it's pretty noisy, n is small. With major caveats I'd say the median value of about 25% isn't too far wrong
We know that some people only use the OpenSAFELY pipeline to generate CSVs and then generate the graphs locally, so they can have a faster feedback loop and use newer libraries. I expect that working with released (not published) files will not be possible in Codespaces, so they will need to continue to use local development environments for this.
It would be good to find out how many researchers this affects. We could do this by counting the number of workspaces that have released CSVs compared to workspaces that have released images.
Timebox: 2 hours