judithabk6 / med_bench

BSD 3-Clause "New" or "Revised" License
8 stars 3 forks source link

[WIP] Global refactoring #90

Open brash6 opened 4 months ago

brash6 commented 4 months ago

The goal of this issue is to describe the future global refactoring of the package. The aim of this refactoring is to :

brash6 commented 4 months ago

Today, overall med_bench structure is like this :

└── src
    ├── med_bench
    │   ├── get_estimation.py
    │   ├── get_simulated_data.py
    │   ├── mediation.py
    │   └── utils
    │       ├── constants.py
    │       ├── nuisances.py
    │       └── utils.py
    └── tests
        ├── estimation
        │   ├── generate_tests_results.py
        │   ├── test_exact_estimation.py
        │   ├── test_get_estimation.py
        │   └── tests_results.npy
        └── simulate_data
            └── test_get_simulated_data.py

Following the work started by @houssamzenati on modularizing estimation functions, I suggest we refactor the package as follow :

└── src
    ├── med_bench
    │   ├── get_estimation.py
    │   ├── get_simulated_data.py
    │   ├── estimation
    │   │   ├── base.py
    │   │   ├── dml.py
    │   │   ├── g_computation.py
    │   │   ├── ipw.py
    │   │   ├── linear.py
    │   │   ├── mr.py
    │   │   └── tmle.py
    │   ├── nuisances
    │   │   ├── conditional_outcome.py
    │   │   ├── cross_conditional_outcome.py
    │   │   ├── density.py (to be removed ?)
    │   │   ├── kme.py (to be removed ?)
    │   │   ├── propensities.py
    │   │   └── utils.py
    │   └── utils
    │       ├── config.py
    │       ├── decorators.py
    │       ├── kernel.py (to be removed ?)
    │       ├── loader.py
    │       ├── parse.py
    │       ├── plots.py
    │       ├── scores.py
    │       └── utils.py
    └── tests
        ├── estimation
        │   ├── generate_tests_results.py
        │   ├── base.py
        │   ├── test_dml.py
        │   ├── test_g_computation.py
        │   ├── test_ipw.py
        │   ├── test_linear.py
        │   ├── test_mr.py
        │   ├── test_tmle.py
        │   ├── test_utils.py
        │   └── tests_results.npy
        └── simulate_data
            └── test_get_simulated_data.py

With the following properties (tbd together more deeply):

Regarding the implementation, I suggest two options :

These overall suggestions are open for discussion, and I would be happy to have your inputs on this

FYI : @judithabk6 @bthirion @houssamzenati

bthirion commented 4 months ago

Thx for the suggestions. I think that this clarifies the design. A few comments:

houssamzenati commented 4 months ago

Thanks for the great push @brash6 here are a few remarks on my side:

As for the specifics of the submodules:

Implementation remarks: I think your suggestions on the branches are great. Thanks for the propositions.

judithabk6 commented 4 months ago

We have taken some time last thursday to discuss the signature of estimators, that should be a starting point to think about the refactor. Here is a beginning, but we did not reach a consensus, but we can start the discussion back from this and iterate

we would have a wrapper that then calls the right function

mediated_effects(method="tmle",
                               mediator_type="binary",
                               nuisance_mu=SklearnRegressor(),
                               nuisance_prop=SklearnClassifier(),
                               nuisance_prop_mediator=SklearnClassifier(),
                               nuisance_density_mediator=kme())

with nuisances cross_conditional_mean_outcome $E[Y(t, M(t'))=E[E[Y|T=t, M=m, X=x]|T=t']$ propensity $P(T=1|X=x)$ propensity_mediator $P(T=1|X=x, M=m)$

judithabk6 commented 4 months ago

List of wanted estimators

estimator binary continuous multidim comment
CoefficientProduct x x x linear regression for each dimension of M and for Y (authorize penalisation): maybe one regression object of M and one for Y
GComputation x x x  binary/discret: classifier for M (multiclass if discrete), regression for y, continuous/multidim: implicit cross condition like _estimate_cross_conditional_mean_outcome_nesting 1 classifier and 1 regressor fitted separately for T=1 and T = 0
IPW x x x _estimate_treatment_probabilities to split in 2, for P(T=1 X) and P(T=1 X,M) - 1 classifiers, instanciated twice
MultiplyRobust x x x  binary/discret: classifier for M (multiclass if discrete), _estimate_cross_conditional_mean_outcome for y, continuous/multidim: implicit cross condition like _estimate_cross_conditional_mean_outcome_nesting 1 classifier and 1 regressor fitted separately for T=1 and T = 0
DML x x x binary/discret: classifier for M (multiclass if discrete), _estimate_cross_conditional_mean_outcome for y, continuous/multidim: implicit cross condition like _estimate_cross_conditional_mean_outcome_nesting 1 classifier and 1 regressor fitted separately for T=1 and T = 0
TMLE x x x default ratio of propensity scores, treatment_probabilities, _estimate_cross_conditional_mean_outcome_nesting
train_data, test_data = split_train_test(data)
MultiplyRobust(mediator_type="binary",
                               nuisance_mu=SklearnRegressor(),
                               nuisance_prop=SklearnClassifier(),
                               nuisance_prop_mediator=SklearnClassifier())\
.fit(train_data.X, train_data.t, train_data.m, train_data.y)\
.estimate(test_data.X, test_data.t, test_data.m, test_data.y)

does not work with DML with cross-fitting

DML(...., cross_fitting=False)
      .fit(train_data.X, train_data.t, train_data.m, train_data.y)\
     .estimate(test_data.X, test_data.t, test_data.m, test_data.y)

DML(...., cross_fitting=True).fit_estimate(data)

or method to implement cross-fitting with other estimators

cf_iterator = KFold(k=5)
for data_train, data_test in cf_iterator:
    result.append(DML(...., cross_fitting=False)
         .fit(train_data.X, train_data.t, train_data.m, train_data.y)\
         .estimate(test_data.X, test_data.t, test_data.m, test_data.y))
np.mean(result)

data would be a named_tuple or a dictionary

judithabk6 commented 1 week ago

@bthirion do you remember what we chose to do for cross-fitting and model selection?

for now maybe let's go for the third solution, and we will add a fit_estimate method if we need later? in that case, we should return a NotImplementedError and explain that to do something similar to crossfit the user should do

cf_iterator = KFold(k=5)
for data_train, data_test in cf_iterator:
    result.append(DML(...., cross_fitting=False)
         .fit(train_data.X, train_data.t, train_data.m, train_data.y)\
         .estimate(test_data.X, test_data.t, test_data.m, test_data.y))
np.mean(result)

@brash6

bthirion commented 1 week ago

To clarify: So, you mean that we do not support built-in cross fitting for now, because there is not enough evidence that it's needed ? That sounds resonable. We need to adjust our ambitions to the time we have.

judithabk6 commented 1 week ago

yes, and we should probably add it back rather soon, but right now I am struggling to have a global view of everything that is going on