Understanding why train/test split is not necessary even in linear regression

Dan-R-Sco commented 1 year ago

I am working on an project where I am comparing a number of different estimators on a observational dataset to estimate the ATE of a treatment (binary) on an outcome (continuous). I am comparing a naive estimation of Y ~ Treatment, a linear regression Y ~ T + X1 + X2 +X3 +X4 + X5 and DML.

My question is related to understanding the justification of model validation of these methods and if its required? It seems that for causality the method of cross-evaluation is generally not applied due to the bias which it can induce. However, DML does have a method of incorporating it into its own evaluation here. I am using the full dataset to estimate on the Linear Regression estimator and the DML estimator which internally implements CV. Is my assumption correct that the CV induces bias and that is the reason why it is not recommended to be used as model validation? Is it valid to compare CI's of estimators and use this for validation along with the refuters?

Additional context link to test-train used in model validation here

Thanks for your time.

github-actions[bot] commented 1 year ago

This issue is stale because it has been open for 14 days with no activity.

github-actions[bot] commented 1 year ago

This issue was closed because it has been inactive for 7 days since being marked as stale.

py-why / dowhy

Understanding why train/test split is not necessary even in linear regression #965