Proposal: running an action should invalidate outputs of all downstream actions

evansd commented 3 years ago

For example, suppose prepare_dataset takes input.csv from generate_cohort and produces a patient_data.dta file. We run prepare_dataset which automatically invokes generate_cohort and now we have input.csv and patient_data.dta on disk.

If we update the study definition and re-run generate_cohort we will have a new input.csv file on disk. But we will also still have the old patient_data.dta file, produced from an older cohort. If we add further actions which depend on prepare_data they will use this old file rather than re-running prepare_data.

My proposal is that by running generate_cohort, all files produced by "downstream" actions should be removed.

I think this in, in one sense, unarguably the correct behaviour; but it's also quite a drastic and potentially destructive change so it could do with some thinking through first.

(Inspired by thinking about the issues raised in https://github.com/opensafely/documentation/pull/217)

sebbacon commented 3 years ago

I think this in, in one sense, unarguably the correct behaviour

Yes. There's one valid use-case where you regenerate the cohort with a tiny change to debug something; i.e. it's a temporary change. This can be handled by requiring people to make a new workspace.

The only problem is that we've not done a great job at inducting people into having a sensible mental model of what github branches are for; we need to add some pointers in the job runner and documentation saying that people should default to treating main as the default workspace and only having other branches for this kind of workflow.

I'm not sure it's that destructive. I can't think of any use-case other than the one above. I think we can get away with just doing it, as long as we also add some kind of warning in the job server.

iaindillingham commented 3 years ago

I agree: I think this is the correct behaviour. As the Cookiecutter Data Science project says, analysis is a DAG. What it doesn't say, unfortunately, is that analysis should also have a reactive dataflow (which is why I like Observable notebooks). Your proposal encourages us to think about analysis as both.

I haven't mentioned Vega this week 😄 , but that also encourages us to think about analysis as both. As well as statistical transformations, it extends this to include visual transformations.

wjchulme commented 3 years ago

I'm in favour of this, but I can think of more scenarios where this change would be disruptive. Mainly for development work, but not necessarily.

Different script, same outputs

I want to change the script but this makes no difference to the outputs, eg refactoring code for readability or efficiency, or adding comments. This means the action is different internally but the outputs are (expected to be) identical. In this case there's no need to re-run the action or any downstream actions, except to check that the changes don't unexpectedly change the outputs. But if by doing this check on dummy data all downstream outputs are deleted, this would be annoying.

Bypassing the invalidation/deletion rule if all of the actions outputs are identical to the previous run seems sensible here. This behaviour would also be useful if I re-extracted data but there were no changes (unlikely, but possible for small cohorts). But it may be tricky to implement -- eg it makes sense to use diffs between commits to do this check rather than comparing new outputs with whatever old outputs are on disk at the time, in case I accidentally do change the output and overwrite what's on disk (so the old "correct" version is lost for the comparison).

Different output files, but they're not used in dependent actions

I want to change some of the output files for an action but those particular files aren't used by any other actions (though other output files from the action are). For example if I want to output an extra column in a table or print additional info to the log file. In this case I'd want to re-run the action, and (some of) the ations outputs are different, but I don't need to re-run any downstream actions and so I don't want their outputs to be deleted.

If there was a way to specify file dependencies for actions (rather than upstream action dependencies as now) then you could use the same checking principle in the previous example by comparing new outputs with old outputs, but only for files that are used by downstream actions. Allowing users to specify file dependencies has been discussed before but don't know if it's actually being considered or not (eg it would also be useful for dealing with slow copying of files when each action is run but I know there are other upcoming solutions to this).

Different output files, but changes don't affect dependent actions

I want to change an action and the outputs are expected to be different, but not in any way that affects downstream actions. For example if I want to add column C to an outputted csv for inspection, but only columns A and B are needed by a downstream action.

Not sure how to deal with this because in order to automate, you'd have to run the downstream actions anyway to check that they're unchanged (in the absence of a super smart parser). Perhaps by using dummy data to do the check, without having to do it in real data?

Another example is if I wanted to reduce the size of a model object by removing some large, uneccesary components, but this had no effect on the object's use by downstream actions. Whilst there's no point in making these changes unless I want to improve my workflow -- and therefore I will expect to be re-running all actions again at some point anyway -- that doesn't mean I want to delete all downstream outputs as they may still be needed until then.

Some alternatives

allow users to optionally turn off auto-deleting, and it instead produces a warning saying "upstream actions have been re-run with modified outputs, this action may need to be rerun / may be out of date"
use a soft delete for downstream actions, which can be restored if required (including some easy navigation of / links to file changes or script changes).

evansd commented 3 years ago

Thanks @wjchulme, these are helpful points. A few responses:

Different script, same outputs

I think we can solve this one via the proposal in #181 which includes recording the MD5 hash of each output file as we create it. This would enable to us to tell when output files are identical.

Different output files, but they're not used in dependent actions

@bloodearnest and I have just been discussing a way to list the specific files used as inputs for each action, rather than naming dependant actions with needs, which (among other advantages) would also solve this particular problem.

Different output files, but changes don't affect dependent actions

I think this one is fundamentally unsolvable in that there can be no reliable, automated way of telling which changes are significant and which aren't.

Alternatives

I think there's scope for quite different default behaviour between development and production here, to support the different kinds of workflow. It's production where I'm most keen to avoid people accidentally running against stale data. How about something like this:

When an action is run all outputs from downstream actions are marked as "stale" in the manifest.json, but not deleted.
In production, stale files are never used as inputs into other actions; they're treated as if they didn't exist and the actions which create them are treated as dependencies which need to be re-run.
Locally, the opensafely tool gives and warning when encountering stale files and asks whether to regenerate them or continue using them.
The osrelease tool gives a warning when attempting to publish stale files.

Would something like that work?

wjchulme commented 3 years ago

I just half wrote down then deleted three scenarios which it turns out are all dealt with nicely with those suggestions. So, yes, I think that would work well

HelenCEBM commented 3 years ago

A real world example for discussion (along the lines of @wjchulme 's third category)

I added a single new field to the study definition and created a separate new notebook to analyse the data including this column. I'm now rerunning the cohort extraction so I can run the new notebook. However, I'm not rerunning the other analyses because (a) I don't need them as inputs for the new notebook and (b) it's running on the same cut of data in the backend anyway so the outputs should be identical to before.

Thoughts arising:

I've already extracted and committed the outputs that I needed from the last run, so it doesn't necessarily matter if all of them are deleted. But it's always possible that it might be helpful to check on some of the files that I didn't commit at some point.
Some notification related to the underlying data being refreshed since the last cohort extraction was carried out might be useful - if it's the same then there might not be a reason to delete and re-run everything. While we're quite familiar with the TPP schedule, external users may not be and EMIS is less familiar to everyone.

iaindillingham commented 3 years ago

I wrote the following to clarify my understanding. There are a couple of questions there, though 🙂

Different script, same outputs

(Item 1 from @wjchulme.)

If we're thinking about creating a hash of the output of the action (e.g. output/input.csv), then a change to the action shouldn't trigger a cascade to downstream actions. From my understanding, though, it would when running in development, because the outputs of actions are randomly generated. It shouldn't when running in production, although are we confident that each backend will return the same results from the same query run at different times?

Outputs as dependencies rather than actions as dependencies

(Item 2 from @wjchulme.)

This is the approach that DVC takes: the run command adds a stage (an action) with dependencies (-d) and outputs (-o and -O) accepting paths (to either files or directories) rather than stages. However, it can be verbose and -- speaking from experience -- confusing. DVC has the dag command to visualise the pipeline, which helps.

The outputs change but the changes have no bearing on the downstream actions

(Item 3 from @wjchulme and from @HelenCEBM.)

Might it help if we conceptually separate changes to the schema and changes to the values? Here, the schema has changed but this change has no bearing on the downstream actions: in the example, a new column has been created, but the existing columns have not been updated. Notice, then, that our dependencies are neither outputs nor actions; they're columns.

wjchulme commented 3 years ago

outputs of actions are randomly generated

They are random but they use the same seed each time, so this shouldn't be an issue. See https://github.com/opensafely-core/cohort-extractor/issues/507

Notice, then, that our dependencies are neither outputs nor actions; they're columns.

Ah yes, this is a very helpful way to conceptualise it. But as @evansd says I'm not sure it's feasible to try to establish a general framework for realising that concept in a reliable and automated way, across multiple languages.

iaindillingham commented 3 years ago

people should default to treating main as the default workspace @sebbacon

Hopefully the above PR is a nudge in this direction.

iaindillingham commented 3 years ago

Could I just check my understanding, @wjchulme? From reading opensafely-core/cohort-extractor#507, it's my understanding that the intention is to use the same seed. However, this feature hasn't been implemented. I'm getting different dummy data when rerunning cohortextractor, you see.

As @remlapmot highlighted, adding the following to the study definition ensures that everything but patient_id is the same.

import numpy
numpy.random.seed(42)

wjchulme commented 3 years ago

this feature hasn't been implemented

ah yes!

opensafely-core / job-runner