opensafely-core / codespaces-initiative

Improving the use of OpenSAFELY in Codespaces
MIT License
0 stars 0 forks source link

How often do researchers generate graphs outside of the OpenSAFELY pipeline? #34

Closed lucyb closed 3 months ago

lucyb commented 3 months ago

We know that some people only use the OpenSAFELY pipeline to generate CSVs and then generate the graphs locally, so they can have a faster feedback loop and use newer libraries. I expect that working with released (not published) files will not be possible in Codespaces, so they will need to continue to use local development environments for this.

It would be good to find out how many researchers this affects. We could do this by counting the number of workspaces that have released CSVs compared to workspaces that have released images.

Timebox: 2 hours

Jongmassey commented 3 months ago

From Job Server ReleaseFiles, categorise by extension:

extension_categories = {
    'html':"report",
    'png':"image",
    'csv':"tabular",
    'json':"tabular",
    'jpeg':"image",
    'txt':"tabular",
    'svg':"image",
    'jpg':"image",
    'gz':"other",
    'ods':"tabular",
    'log':"other",
    'ipynb':"report",
    'pdf':"report",
    'db':"other",
    'eps':"image",
    'tiff':"image",
    'csv#':"tabular",
    'xls':"tabular"}

Then group by project id, file category, count number of files. For each project, calculate the proportion of "finished" (image and report) files of total files released (excluding "other"):

count    92.000000
mean      0.328568
std       0.322190
min       0.000000
25%       0.000000
50%       0.247652
75%       0.542579
max       1.000000

image

So you could say that on average, 25% of the released outputs of a project are those that could have been produced post-release. Now, this is clearly a bit wrong because tabular data itself is an output that may be used in a manuscript, and there will be instances where both a tabular and graphical representation of the same data are released.

27 (of 92) projects have only ever released tabular data.

Any one got any more ideas how this could be analysed?

bloodearnest commented 3 months ago

I think breaking down over time could be helpful.

I would expect more images to have been released in early OS life, and less as time has gone on (based on the assumption that users have realised they don't need to, and its a PITA to do so).

Jongmassey commented 3 months ago

image

bloodearnest commented 3 months ago

What's the dotted line?

Am also suspicious we can conclude much from the downward trend, because we've been so much quieter for the last 6mo?

Jongmassey commented 3 months ago

I couldn't be bothered fighting with matplotlib to get it in the same legend, sorry! It's the proportion

Jongmassey commented 3 months ago

Yep, I think that we shouldn't conclude there's been a significant downward trend in the proportion of image/report outputs in the past 12mo or so; it's pretty noisy, n is small. With major caveats I'd say the median value of about 25% isn't too far wrong