opensafely-core / cohort-extractor

Cohort extractor tool which can generate dummy data, or real data against OpenSAFELY-compliant research databases
Other
37 stars 13 forks source link

Improve expectations framework #221

Open wjchulme opened 4 years ago

wjchulme commented 4 years ago

The Expectations framework currently allows you to specify the expected distribution of each variable independently. Here's a list of features that would be useful to see.

Inter-variable dependencies

For example, to specify that Metformin is only prescribed to people with diabetes. Additive Bayesian Networks (sounds complicated but isn't) or simulations from DAGs offer a convenient framework and could maybe be incorporated into existing expectations framework like this:


StudyDefinition(

...

    age=patients.age_as_of(
        "2020-02-01",
        return_expectations={
            "rate": "universal",
            "int": {"distribution": "population_ages"},
        },
    ),
    sex=patients.sex(
        return_expectations={
            "rate": "universal",
            "category": {"ratios": {"M": 0.49, "F": 0.51}},
        }
    ),
    chronic_cardiac_disease=patients.with_these_clinical_events(
        chronic_cardiac_disease_codes,
        returning="binary_flag",
        return_expectations={
        "incidence": {glm(-4 + 0.005*age + -0.1*sex:F, link='logistic').sample()}
        },
    ),
      metformin=patients.with_these_medications(
        metformin_codes,
        returning="binary_flag",
        return_expectations={
        "incidence": {condition=[diabetes_type2=="yes", diabetes_type2=="no"], categories=["yes", "no"], prop=[ [0.9, 0.1], [0, 1] ] }
        },
    ),

...

}

This just says we want age and sex to not depend on any other variable, but we want chronic_cardiac_disease to depend on age and sex, such that the log-odds of chronic cardiac disease increase by 0.05 for every year of age, decrease by 0.1 if you're female, and -4 is the baseline rate when age=0 and sex:F=0. Basically, chronic_cardiac_disease depends on age and sex via a Generalised Linear Model. This is how Additive Bayesian Networks work. Not difficult to extend to e.g., survival data.

For purely categorical dependencies, it may be easier to specify expected proportions explicitly using rather than through a GLM, as done with metformin above.

A bonus for this is that if you can capture all the return_expectations arguments and put them in a list, you can use it to create a DAG that describes the dependencies you've specified. This makes it easier to check for silliness. So for the above it would show a graph like "sex --> ccd <-- age". In theory, you could create the DAG first in a different framework and then import it to the Study Definition.

It may also be useful to generate latent / hidden variables that are used to derive variables, but are then dropped in the final dataset.

Inter-patient dependencies

For example, household size should match the number of patients with the same household ID (or at least be close), and patients within the same household should also reside in the same region. It'll get tricky depending on the complexity of the hierarchies you want to impose. The obvious way is to simulate IDs level-by-level starting with the highest (eg region) to the lowest (eg patient) where the child level inherits the properties/variables of its parent(s). This could be done either with nested study definitions, or with expectations that are able to count existing ID variables or sample (without replacement) from existing ID variables.

More flexible random variables

Currently only a few distributions are supported. Is there a comprehensive random variable library that's easy-to-learn, well-documented, and has consistent syntax that could be used instead? numpy.random? random? So rather than "float": {"distribution": "normal", "mean": 35, "stddev": 10} we somehow specify much more directly the underlying random variable function being used normal(mean=35, sigma=10) (or dist="normal", mean=35, sigma=10 or whatever syntax works best). The advantage here is that we don't have to manually add back-end support for new distributions, you just have to look up the library documentation. We'd obviously keep special distributions such as an age distribution that matches UK population.

Tweaking dummy datasets after they've been generated

In some cases it may be much easier to alter the dummy data to conform to requirements after it's been generated, rather than try to define the requirements within expectations itself. For example, by adding if var1=="a" and var2==NaN then var3=0 after the StudyDefinition call. This would run when creating dummy data but not when running real data.

This is quite an informal approach but would be helpful because

Automatically assessing agreement against real data

Significance tests for checking agreement between expected and observed distributions (eg Kolmogorov-Smirnov test) are not likely to be helpful as they will almost certainly be highly significant (curse of large numbers) without actually telling you anything useful about how the distributions differ. Plotting overlaying histograms for each variable would be more informative. If that's potentially disclosive, you could report quantile (proportion) differences for continuous (categorical) variables, and the "earth mover's distance" (L1-norm of expected-observed differences) which at least gives you a single value metric of discrepancy in the unit of interest. This could be incorporated into the generate_report command. It would require simultaneous access to a dummy dataset and the real dataset.

sebbacon commented 4 years ago

This is great, thanks. I think it's useful to start by imagining how a user would specify expectations, so I found your example study definition really helpful. To that end, I'm wondering what these would look like:

Your idea about tweaking data after generation is interesting. I think you're suggesting it would still be a restricted grammar of some kind? If so, I'm not clear if it's logically different from the current approach, just stylistically that it might be a bit easier to read as a set of conditions that comes after the variable definitions?

As a general point, I'm particularly interested in discussing the limits of this approach. At some point it become more sensible to generate representative synthetic data, because eventually the complexity of the expectations would become unreadable. On the other hand, what I like about expectations (as opposed to synthetic data) is that they document your key assumptions.

My hypothesis is that there is a sweet spot for the expectations approach, and defining its limits will help us conceptualise its grammar, if that makes sense.

sebbacon commented 4 years ago

Some notes from recent Google presentation

sebbacon commented 3 years ago

Here is a correlation matrix approach using Stata which @SJWEvans pointed out:

https://ageconsearch.umn.edu/record/271702/files/sjart_st0382.pdf

Using this I simulated 4 variables; 20 million rows in about 20 secs on my Dell PC
. rbinary x1 x2 x3 x4, means(.05,.10,.15,.20) ///
 corr(1,.3,.2,.1\.3,1,.3,.2\.2,.3,1,.3\.1,.2,.3,1) ///
 n(20000000) seed(12345)
. corr
(obs=20,000,000)
       |   x1   x2   x3   x4
-------------+------------------------------------
     x1 | 1.0000
     x2 | 0.2965 1.0000
     x3 | 0.1996 0.3024 1.0000
     x4 | 0.1008 0.2006 0.3010 1.0000
. su
  Variable |    Obs    Mean  Std. Dev.   Min    Max
-------------+---------------------------------------------------------
     x1 | 20,000,000  .0500423  .2180324     0     1
     x2 | 20,000,000  .1000606  .3000808     0     1
     x3 | 20,000,000  .149971  .357043     0     1
     x4 | 20,000,000  .1999458  .3999593     0     1
.
The code there begins with a multivariate Normal and converts it to binary. I’m sure it could be amended to have some continuous and some binary.
I haven’t checked how long it takes as a function of number of variables but will do so
StatsFizz commented 3 years ago

Will's point about dependencies is spot on. The approach of specifying an equation is fine for one or two variables to depend on others, but (i) it will get complicated very quickly, e.g. if you have 25 variables that you want to be all correlated, and (ii) really what you want is for the equation to mimic the real one in the data, which by definition you don't know (ie., really you want the computer to pick those coefficients up from the real data). But you could input the form of how you want variables to be associated, e.g. "(lin regress) Y = age + sex, or (logistic regress) coviddead = sex*cardiac + respiratory)" and then Stata fits the appropriate model in the real data and extracts the coefficients. But that would make the process quite a bit slower, presumably. Or you could feed in the relationships you want to see in the dummy data in the form of a DAG, as Will suggested, which would be similar. A DAG would tell the computer which subset of variables should be used to generate the next. Again, basing the magnitude of relationships on the real data would be useful. It would be incredibly hard to specify realistic parameter estimates without seeing the data.

Along similar lines to the coefficient thing above, this is often how I simulate datasets: In the real data, I try to transform the variables (as far as is possible) to multivariate normal (MVN), extract the covariate matrix, simulate from MVN, and then back-transform and convert to the appropriate format (binary, cat, cts). This might be the most generalisable and simple approach.

There is added complexity for some analyses, e.g. clustering within household. But I don't think we should aim to solve every possible problem first-up.

SJWEvans commented 3 years ago

I could be quite wrong, but I wonder if, with a DAG you could at least put the direction of the relationship between each variable where you know something, and then where you don’t know just put in a random correlation between say 0.1 and -0.1? I’m just trying to find a suggestion that would speed up code testing.

Another approach might be to obtain a correlation matrix on 10% sample of the real data, then use that correlation matrix to create simulated data with the full size of the database. My impression was that quite a number of failures were from the size of the problem making the servers fall over, but I’m quite prepared to agree all my ideas won’t help to make faster progress with getting full analyses done. That was all I was trying to do.

All the best

Stephen

wjchulme commented 3 years ago

https://github.com/opensafely/mv-dummy-data prototype multivariate / DAG-informed dummy data generation code in R is here, with a notebook explaining how it works

wjchulme commented 3 years ago

from @SJWEvans above

I wonder if, with a DAG you could at least put the direction of the relationship between each variable where you know something, and then where you don’t know just put in a random correlation between say 0.1 and -0.1

The problem with only specifying correlations is that it doesn't extend naturally to different variable types. How does a corellation coefficient capture that diabetes (a y/n variable) will on average result in earlier death (censored date variable)? Or that hospital admission dates must precede death dates? It only really works (and is commonly implemented) for MVN.

from @StatsFizz:

It would be incredibly hard to specify realistic parameter estimates without seeing the data

This is very true, and a bit of a worry with the DAG approach. It would result in a fair amount of trial and error to make sure you have, say, a sensible distribution of death probabilities at the end of a long chain of dependencies. There are a few things that can help with this:

SJWEvans commented 3 years ago

Hi Will

I don’t understand about the problems with binary variables. You can specify a correlation matrix for them. You can also allow for a time to event that is censored- you allow it to have a distribution (possibly log normal as a crude approximation), but you have to add the censoring based on the time you generate, but that could be done once the data are outside the server. Similarly with hosp admission & death date. You can generate implausible values but as a step before use is made of the data you run an “implausibility checker” to edit the data to get plausible values.

Perhaps all this is nonsense, but happy to discuss

Stephen

sebbacon commented 3 years ago

Lots to think about here, thanks all. The main thing that sticks out is this:

It would be incredibly hard to specify realistic parameter estimates without seeing the data This is very true... There are a few things that can help with this:

Encourage simplicity. The primary purpose of the dummy data is to help write analysis code without being blind to the data

I think this is absolutely central to the whole thing. The primary purpose of dummy data is so you can write any code at all. In my mind, it is not for the equation to mimic the real one in the data. At least in the first instance.

It's actually quite interesting how often even the current approach works for this. For example the households modelling -- my worry was that completely unrealistic data would fail to converge every time, which would result in absurdly long test runs. But with a little tweaking it fails to converge very quickly!

It might be useful to step back and identify where the naive approach has not worked, and what the minimum would be to fix those specific cases?


There is a secondary point I'd like to highlight, too. The reason I called it "expectations" was the (perhaps wild and silly) idea it would give an analyst a useful opportunity to test their assumptions about the data systematically. Rather than waiting for the results of a big regression and saying "does that look right?", you'd look at the distributions of each variable much earlier, and ask the same sort of thing. My idea was by asking the analyst to say "I think this variable would be normally distributed around 10", we could (semi) automate a response "hey, that variable is actually very right-skewed, is that surprising to you?" It would be a kind of assumption-debugging step early on in the process. Is this a good idea, or not a good idea?!

SJWEvans commented 3 years ago

I agree your second bit: I still think you can do something better than just realistic univariable distributions.

Stephen

evansd commented 3 years ago

Just to add here, I've been thinking about a way for people to supply custom dummy data generating commands in their project pipelines. Then instead of generating its own dummy data the system would run the custom command and do some minimal validation of the output to check that its columns and types matched those of the real study definition.

This would allow us to keep the default path simple while providing an escape hatch for the occasions when something more sophisticated is required.

StatsFizz commented 3 years ago

It’s a good point. I don’t know if it’s helpful, but here are some of the things I’ve had problems with:

  1. Cluster analysis – I needed a certain amount of cases/not within groups defined by age, sex, ethnicity. The code kept failing due to insufficient observations to cluster within those groups.
  2. MI looping over regions – kept failing because some regions didn’t have enough data to work.
  3. Taking case-cohort samples – when you sample 0.02/100 of males under 50 you get zero sample sizes and then the rest of the code fails. This could be sorted by a bigger initial sample and a reasonable proportion of events.
  4. All my A&E stuff failed on the server because the real data had a zero on 1/mar/2020 but the dummy data obviously didn’t.
  5. Programs to pick up non-constant coefficients from regression models didn’t work because there weren’t any (no correlation in data)
  6. Getting a sense of how long different options in lasso take. The time taken is entirely dependent on the structure of the data. I wrote a bunch of code which took ~ 3 mins in dummy data of the same overall size and months later we’re still waiting…
  7. Comparing full data with sampling approaches – had to run on the server to get a sense of whether you gain anything time-wise. Again, you need the correlation structure for this sort of thing.
  8. Debugging new methods – if you don’t know the convergence properties of a method then it’s very hard to predict which characteristics of the data are important (it’s not realistic to get dummy data to achieve this)

Some of these things are not worth striving for, e.g. things like 7 we can’t anticipate so there’s no point in trying. Things like 6 would be useful, but it’s only possible if you can replicate the correlation structure, which is complex and may not be worth the effort.

Broad things I think would be useful:

I haven’t really thought through what we’re trying to achieve. I just know that I often make my own dummy data (or remake the given dummy data) to test my code on. It’s often a size thing – I don’t want to generate an initial dummy data of 17million because then the early testing would take days but then the later bits don’t have sufficient sample sizes…

wjchulme commented 3 years ago

@evansd taking the dummy data generating process outside of a study definition is something I'm really keen to see. If it can support a completely external method (eg you could use the method I wrote in R), then this is a useful fallback if internal options aren't quite doing the business. Checking that variable types are matched between dummy and real data is also a must i think.

@StatsFizz This is really useful, thank you. I hadn't considered some of those issues before. A reccurring theme is low counts/cases for some categories or combinations of characteristics -- it's certainly possible to guarantee sufficient counts with the right pDAG, but it might not be easy to actually identify that pDAG without a lot of trial and error (unless you use a very simple model for the variable/strata of interest, but that might not always be possible). One option is to make it possible to impose a lower bound for the probabilities that generate a particular value of a categorical variable. I'll think about this a bit more!

A note about "replicating the correlation structure". This could mean a few different things and I think it's worth being explicit about what these are to avoid confusion (this is for my benefit as much as anyone else):

sebbacon commented 3 years ago

Proposal for incremental ratcheting improvements:

  1. Support a "dummy data" mode which allows a user to write their own pipeline action to generate test data; this action only run when generating dummy data
  2. The output of "dummy data" actions must be validated against study definitions: that is, their types (and if nullable) must match
  3. Move expectations definitions to own file. Generating dummy data is then just a different implementation of the "dummy data" mode above (i.e. running opensafely generate_dummy or similar)
  4. This then makes it possible to support several flavours of dummy data / expectations, in parallel. We could keep the univariate implementation, for example, and also provide an Additive Bayesian Network implementation, and a correlation matrix, and something that provides random samples from parametric estimates of underlying distributions in real data

FWIW, as someone who doesn't really know what they are talking about, I find the Additive Bayesian Network the most parsimonious / intuitive way to express most of the things discussed here. Not knowing what I'm talking about may be a feature or a bug in this context...

If we took that approach I'm finding it hard to find reasons not to use STAN for that approach.

sebbacon commented 3 years ago

@wjchulme I suggest we have a call about this and then make a final decision together, book me in!

sebbacon commented 2 years ago

Just to note Plasmode Simulations came up earlier

paper, R package

wjchulme commented 2 years ago

Thanks, Seb. If we want to create dummy extract data, or dummy database data, we need to simulate a from "data model", not a "statistical model". So I don't think this method is general enough to be useful for our needs as it requires a specification of an underlying statistical model of interest. Usually, for any given study, we'd be interested in lots of different models of this type (for different outcomes, different exposures, etc). And it doesn't naturally accomodate variables upstream of the model, eg raw dates, variables used as inputs for derived variables, or for defining study eligiblity, or even just variables used to describe the cohort that aren't used in a statistical model itself.