Merging DoubleMachineLearning into MultiplyRobust

brash6 commented 1 day ago

@houssamzenati Inside the med_bench_prototype code, you implemented MultiplyRobust' fit function like this :

def fit(self, t, m, x, y):
        """Fits nuisance parameters to data
        """
        # bucketize if needed
        t, m, x, y = self.resize(t, m, x, y)

        # fit nuisance functions
        self.fit_score_nuisances(t, m, x, y)

        if self._ratio == 'density':
            self.fit_treatment_propensity_x_nuisance(t, x)
            self.fit_mediator_nuisance(t, m, x)

        elif self._ratio == 'propensities':
            self.fit_treatment_propensity_x_nuisance(t, x)
            self.fit_treatment_propensity_xm_nuisance(t, m, x)

        else:
            raise NotImplementedError

        if self._procedure == 'nesting':
            self.fit_cross_conditional_mean_outcome_nuisance(t, m, x, y)

        elif self._procedure == 'discrete':

            if not is_array_integer(m):
                self._bucketizer = KMeans(n_clusters=10, random_state=self.rng,
                                          n_init="auto").fit(m)
                m = self._bucketizer.predict(m)
            self.fit_mediator_nuisance(t, m, x)
            self.fit_cross_conditional_mean_outcome_nuisance_discrete(t, m, x, y)

        else:
            raise NotImplementedError

        self._fitted = True

        if self.verbose:
            print(f"Nuisance models fitted")

We can see that there's two discrimination variables : ratio and procedure

From my understanding about our discussion today, with some questions in the middle :

DML estimator is equivalent to the ratio='propensities' scenario of the MultiplyRobust estimator, where m is continuous (and multidimensional ?).
procedure variable enables to specify, in the scenario where m is continuous, if we want to "discretize" m, in order to make the estimation faster ?
We want to get rid of this discrete scenario, and therefore also get rid of the procedure variable
Additionally, we want to get rid of the ratio variable and detect automatically if we are in the density (m is discrete) or propensities (m is continuous or discrete) scenario ?!

fyi @judithabk6

bthirion commented 1 day ago

I may have missed some of the previous steps, but I would add the logic to infer such parameters automatically after the current refactoring. It's an optimization, not really a must-have imho.

judithabk6 commented 1 day ago

@bthirion one of the first step included some legacy code from @houssamzenati, so we are rather reverting back to the current state

not a must-have but not so complicated I think?

houssamzenati commented 1 day ago

@brash6 yes I agree with what you wrote, it is indeed what we said yesterday @bthirion we are indeed trying to implement automatic inference of such variants, the only parameter that will be let is ratio='propensities' or 'density' for the mediator m when it is only discrete only. Would you like to also enforce a single option in that case? we thought with judith that in that case it was beneficial to let the choice.

brash6 commented 1 day ago

@houssamzenati It's clear on my side, so :

We remove the discrete scenario (and therefore the _procedure variable)
We keep the _ratio variable to enable the choice between propensities and density when m is discrete
Finally, do we remove DML in this PR (from my understanding, if DML is just a subcase of MR, ce can just remove it in this PR after the two first points are done ? @bthirion @houssamzenati @judithabk6

judithabk6 commented 1 day ago

I agree with the 3 points

houssamzenati commented 3 hours ago

yes you can remove DML in the PR after the points are treated and also when cross fitting is implemented (is back actually) in the code.

judithabk6 / med_bench

Merging DoubleMachineLearning into MultiplyRobust #96