Open evansd opened 3 years ago
I think this in, in one sense, unarguably the correct behaviour
Yes. There's one valid use-case where you regenerate the cohort with a tiny change to debug something; i.e. it's a temporary change. This can be handled by requiring people to make a new workspace.
The only problem is that we've not done a great job at inducting people into having a sensible mental model of what github branches are for; we need to add some pointers in the job runner and documentation saying that people should default to treating main
as the default workspace and only having other branches for this kind of workflow.
I'm not sure it's that destructive. I can't think of any use-case other than the one above. I think we can get away with just doing it, as long as we also add some kind of warning in the job server.
I agree: I think this is the correct behaviour. As the Cookiecutter Data Science project says, analysis is a DAG. What it doesn't say, unfortunately, is that analysis should also have a reactive dataflow (which is why I like Observable notebooks). Your proposal encourages us to think about analysis as both.
I haven't mentioned Vega this week 😄 , but that also encourages us to think about analysis as both. As well as statistical transformations, it extends this to include visual transformations.
I'm in favour of this, but I can think of more scenarios where this change would be disruptive. Mainly for development work, but not necessarily.
I want to change the script but this makes no difference to the outputs, eg refactoring code for readability or efficiency, or adding comments. This means the action is different internally but the outputs are (expected to be) identical. In this case there's no need to re-run the action or any downstream actions, except to check that the changes don't unexpectedly change the outputs. But if by doing this check on dummy data all downstream outputs are deleted, this would be annoying.
Bypassing the invalidation/deletion rule if all of the actions outputs are identical to the previous run seems sensible here. This behaviour would also be useful if I re-extracted data but there were no changes (unlikely, but possible for small cohorts). But it may be tricky to implement -- eg it makes sense to use diffs between commits to do this check rather than comparing new outputs with whatever old outputs are on disk at the time, in case I accidentally do change the output and overwrite what's on disk (so the old "correct" version is lost for the comparison).
I want to change some of the output files for an action but those particular files aren't used by any other actions (though other output files from the action are). For example if I want to output an extra column in a table or print additional info to the log file. In this case I'd want to re-run the action, and (some of) the ations outputs are different, but I don't need to re-run any downstream actions and so I don't want their outputs to be deleted.
If there was a way to specify file dependencies for actions (rather than upstream action dependencies as now) then you could use the same checking principle in the previous example by comparing new outputs with old outputs, but only for files that are used by downstream actions. Allowing users to specify file dependencies has been discussed before but don't know if it's actually being considered or not (eg it would also be useful for dealing with slow copying of files when each action is run but I know there are other upcoming solutions to this).
I want to change an action and the outputs are expected to be different, but not in any way that affects downstream actions. For example if I want to add column C to an outputted csv for inspection, but only columns A and B are needed by a downstream action.
Not sure how to deal with this because in order to automate, you'd have to run the downstream actions anyway to check that they're unchanged (in the absence of a super smart parser). Perhaps by using dummy data to do the check, without having to do it in real data?
Another example is if I wanted to reduce the size of a model object by removing some large, uneccesary components, but this had no effect on the object's use by downstream actions. Whilst there's no point in making these changes unless I want to improve my workflow -- and therefore I will expect to be re-running all actions again at some point anyway -- that doesn't mean I want to delete all downstream outputs as they may still be needed until then.
Thanks @wjchulme, these are helpful points. A few responses:
Different script, same outputs
I think we can solve this one via the proposal in #181 which includes recording the MD5 hash of each output file as we create it. This would enable to us to tell when output files are identical.
Different output files, but they're not used in dependent actions
@bloodearnest and I have just been discussing a way to list the specific files used as inputs for each action, rather than naming dependant actions with needs
, which (among other advantages) would also solve this particular problem.
Different output files, but changes don't affect dependent actions
I think this one is fundamentally unsolvable in that there can be no reliable, automated way of telling which changes are significant and which aren't.
I think there's scope for quite different default behaviour between development and production here, to support the different kinds of workflow. It's production where I'm most keen to avoid people accidentally running against stale data. How about something like this:
opensafely
tool gives and warning when encountering stale files and asks whether to regenerate them or continue using them.osrelease
tool gives a warning when attempting to publish stale files.Would something like that work?
I just half wrote down then deleted three scenarios which it turns out are all dealt with nicely with those suggestions. So, yes, I think that would work well
A real world example for discussion (along the lines of @wjchulme 's third category)
I added a single new field to the study definition and created a separate new notebook to analyse the data including this column. I'm now rerunning the cohort extraction so I can run the new notebook. However, I'm not rerunning the other analyses because (a) I don't need them as inputs for the new notebook and (b) it's running on the same cut of data in the backend anyway so the outputs should be identical to before.
Thoughts arising:
I wrote the following to clarify my understanding. There are a couple of questions there, though 🙂
(Item 1 from @wjchulme.)
If we're thinking about creating a hash of the output of the action (e.g. output/input.csv), then a change to the action shouldn't trigger a cascade to downstream actions. From my understanding, though, it would when running in development, because the outputs of actions are randomly generated. It shouldn't when running in production, although are we confident that each backend will return the same results from the same query run at different times?
(Item 2 from @wjchulme.)
This is the approach that DVC takes: the run
command adds a stage (an action) with dependencies (-d
) and outputs (-o
and -O
) accepting paths (to either files or directories) rather than stages. However, it can be verbose and -- speaking from experience -- confusing. DVC has the dag
command to visualise the pipeline, which helps.
(Item 3 from @wjchulme and from @HelenCEBM.)
Might it help if we conceptually separate changes to the schema and changes to the values? Here, the schema has changed but this change has no bearing on the downstream actions: in the example, a new column has been created, but the existing columns have not been updated. Notice, then, that our dependencies are neither outputs nor actions; they're columns.
outputs of actions are randomly generated
They are random but they use the same seed each time, so this shouldn't be an issue. See https://github.com/opensafely-core/cohort-extractor/issues/507
Notice, then, that our dependencies are neither outputs nor actions; they're columns.
Ah yes, this is a very helpful way to conceptualise it. But as @evansd says I'm not sure it's feasible to try to establish a general framework for realising that concept in a reliable and automated way, across multiple languages.
people should default to treating main as the default workspace @sebbacon
Hopefully the above PR is a nudge in this direction.
Could I just check my understanding, @wjchulme? From reading opensafely-core/cohort-extractor#507, it's my understanding that the intention is to use the same seed. However, this feature hasn't been implemented. I'm getting different dummy data when rerunning cohortextractor, you see.
As @remlapmot highlighted, adding the following to the study definition ensures that everything but patient_id
is the same.
import numpy
numpy.random.seed(42)
this feature hasn't been implemented
ah yes!
For example, suppose
prepare_dataset
takesinput.csv
fromgenerate_cohort
and produces apatient_data.dta
file. We runprepare_dataset
which automatically invokesgenerate_cohort
and now we haveinput.csv
andpatient_data.dta
on disk.If we update the study definition and re-run
generate_cohort
we will have a newinput.csv
file on disk. But we will also still have the oldpatient_data.dta
file, produced from an older cohort. If we add further actions which depend onprepare_data
they will use this old file rather than re-runningprepare_data
.My proposal is that by running
generate_cohort
, all files produced by "downstream" actions should be removed.I think this in, in one sense, unarguably the correct behaviour; but it's also quite a drastic and potentially destructive change so it could do with some thinking through first.
(Inspired by thinking about the issues raised in https://github.com/opensafely/documentation/pull/217)