pzivich / zEpid

Epidemiology analysis package
http://zepid.readthedocs.org
MIT License
141 stars 33 forks source link

generalize branch #80

Closed pzivich closed 5 years ago

pzivich commented 5 years ago

In the future, I think a zepid.causal.generalize branch would be a useful addition. This branch would contain some generalizability and transportability tools. Specifically, the g-transport formula, inverse probability of sampling weights, inverse odds of sampling weights, and doubly robust generalizability.

Generally, I think I can repurpose a fair bit of the existing code. I need to consider how best to handle the distinction between generalizability (sample from target population) and transportability (sample not from target population). I am imagining that the user would pass in two data sets, the data to estimate the model on, and the other data set to generalize to.

As far as I know, these resources are largely lacking in all other commonly used epi softwares. Plus this is becoming an increasingly hot topic in epi (and I think it will catch on more widely once people recognize you can go from your biased sample to a full population under an additional set of assumptions)

Resources:

Notes: Some record of Issa's presentation at CIRG. This is the more difficult doubly robust estimator. It is when only a subset of the cohort has some measure needed for transportability. Rather than throwing out the individual who don't have X2 measures, you can use the process in the arXiv paper. For the nested-nested population, the robust estimator has three parts. b_a(X_1, S) is harder to estimate but you can use the following algorithm

  1. model Y as a function of X={X1, X2} among S=1 and A=a

  2. Predict \hat{Y} among those with D=1

  3. Model \hat{Y} as X1, S in D=1

  4. Predict \hat{Y*} in D={0, 1}

Also hard to estimate Pr(S=1|X) becuase X2 only observed for subset. Can use m-estimator to obtain. Can do this by a weighted regression with 1 for D=1 & S=1 and 1/C(X1, S) for D=1 & S=0. This is a little less obvious to me but seems doable

pzivich commented 5 years ago

Marking to look at later: https://tech.wayfair.com/2018/10/pylift-a-fast-python-package-for-uplift-modeling/

Does some similar stuff, but finds ideal customer to target. Might be a useful approach to look into consider. Slightly different from the generalizability/transportability problem, but it is closely related

pzivich commented 5 years ago

How g-transport should work:

1) Fit g-formula to the sample with observed exposure/outcome 2) Use fitted model to predict to full sample 3) Take mean to generate marginal outcome

pzivich commented 5 years ago

As interesting aside, I think IPSW is going to be broadly more useful. Reason being is that you don't need the confounders measured in the target sample (you only need the modifiers). For the g-transport-formula or the doubly robust estimator you need both.

This only applies to observational data where you would need to account for confounding. However, that's still a major point in favor of IPSW

pzivich commented 5 years ago

As a reminder for myself, the g-transport formula and IPSW have some differences in the sample data. These differences are just a result of the sample taken. I performed some simulations to check that both are approximately unbiased. Below is a quick simulation to demonstrate that

import numpy as np
import pandas as pd
from scipy.stats import logistic
from zepid.causal.generalize import IPSW, GTransportFormula

g_gen = []
g_tra = []
i_gen = []
i_tra = []

for i in range(200):
    n = 1000
    df = pd.DataFrame()
    df['L'] = np.random.binomial(1, p=0.35, size=n)
    df['A'] = np.random.binomial(1, p=0.5, size=n)
    df['Y1'] = np.random.binomial(1, p=logistic.cdf(-0.5 + 0.75 * 1 + 0.3 * df['L'] - 1.1 * 1 * df['L']), size=n)
    df['Y0'] = np.random.binomial(1, p=logistic.cdf(-0.5 + 0.75 * 0 + 0.3 * df['L'] - 1.1 * 0 * df['L']), size=n)
    df['Y'] = df['Y1'] * df['A'] + df['Y0'] * (1 - df['A'])
    df['S'] = np.random.binomial(1, p=0.67 - 0.4 * df['L'], size=n)

    ipsw = IPSW(df, exposure='A', outcome='Y', selection='S')
    ipsw.regression_models('L', print_results=False)
    ipsw.fit()
    i_gen.append(ipsw.risk_difference - 0.09)

    gtransport = GTransportFormula(df, exposure='A', outcome='Y', selection='S', generalize=True)
    gtransport.outcome_model('A + L + L:A', print_results=False)
    gtransport.fit()
    g_gen.append(gtransport.risk_difference - 0.09)

print('Bias - IPSW:', np.mean(i_gen))
print('Bias - G:', np.mean(g_gen))

As I would expect (knowing more) IPSW does slightly better (bias approaches zero at a faster rate). It makes sense since IPSW has less to model than the g-transport formula

Results (2019/3/13)

Bias - IPSW: -0.002591761051227205
Bias - G: -0.0026239656430772397
pzivich commented 5 years ago

I am having trouble getting AIPSW to be a consistent estimator in the above example when g-transport is misspecified (i.e. Y ~ A + L). It does better than the g-transport but AIPSW doesn't get closer to zero when I bump the sample size from 10,000 to 1,000,000

Maybe an edge case since L is a strong modifier? Either way, I need to dig deeper into this

UPDATE: based on some simulations g-transport seems to dominate between the two measures. I suppose that makes sense based on the formula? I think the above is an edge case. At least it seems like that when I adapt the above code to:

for i in range(200):
    n = 1000
    df = pd.DataFrame()
    df['C'] = np.random.binomial(1, p=0.4, size=n)
    df['L'] = np.random.binomial(1, p=0.8*df['C'] + 0.2*(1-df['C']), size=n)
    df['A'] = np.random.binomial(1, p=0.5, size=n)
    df['Y1'] = np.random.binomial(1, p=logistic.cdf(-0.5 + 0.75 * 1 + 0.3 * df['L'] - 1.1 * 1 * df['L']), size=n)
    df['Y0'] = np.random.binomial(1, p=logistic.cdf(-0.5 + 0.75 * 0 + 0.3 * df['L'] - 1.1 * 0 * df['L']), size=n)
    df['Y'] = df['Y1'] * df['A'] + df['Y0'] * (1 - df['A'])
    df['S'] = np.random.binomial(1, p=0.67 - 0.4 * df['L'] + df['C']*0.07, size=n)

    true = 0.03817

    ipsw = IPSW(df, exposure='A', outcome='Y', selection='S', generalize=False)
    ipsw.regression_models('L', print_results=False)
    ipsw.fit()
    i_gen.append(ipsw.risk_difference - true)

    gtransport = GTransportFormula(df, exposure='A', outcome='Y', selection='S', generalize=False)
    gtransport.outcome_model('A + C + C:A', print_results=False)
    gtransport.fit()
    g_gen.append(gtransport.risk_difference - true)

    aipw = AIPSW(df, exposure='A', outcome='Y', selection='S', generalize=False)
    aipw.weight_model('L', print_results=False)
    aipw.outcome_model('A + C + C:A', print_results=False)
    aipw.fit()
    a_gen.append(aipw.risk_difference - true)

Gives me

Bias - IPSW: -0.017543562941460363
Bias - G: 0.0503045192696528
Bias - AIPSW: 0.017478819696215078

which is more of what I would have expected. Additionally bumping up the sample to 1mil for a single run has AIPSW looking like a consistent estimator. I think I just may have found an example that "breaks" AIPSW, merely because there is so little variation in estimation. Unlikely to ever happen in practice since there is often a multitude of modifiers in all studies. Furthermore, it only "breaks" in that it didn't approach 0 bias that quickly (still did so faster than the mispecified g-transport, so still the better estimator)

pzivich commented 5 years ago

The release of the general generalizability formulas for v0.6.0 are unbiased in my simulations I have run

Issue will remain open. Still need to implement the 3-part double robust estimator. Will be on the back-burner for me, since the utility of that estimator is not as straight-forward as the others

pzivich commented 5 years ago

3-part estimator has a more narrow use-case and is more complicated to implement. As a result, will delay implementation. May re-open at some point (or if users want this implemented)