vincentarelbundock / pymarginaleffects

GNU General Public License v3.0
49 stars 9 forks source link

Test analytic #87

Closed LamAdr closed 7 months ago

LamAdr commented 7 months ago

Translated from https://github.com/vincentarelbundock/marginaleffects/blob/main/inst/tinytest/test-analytic.R

There is a MIT license, an acknowledgment that the code is copied from margins, and a reference to Matt Golder's website in the R version. Should we carry those here?

Also, the log test fails. This is because self.model.model.exog_names is ['Intercept', 'np.log(x)'] and self.modeldata.columns is ['y', 'x']. So this line empties the variable names. I commented out the test, but maybe you'd prefer to remove it altogether.

Please let me know if you want anything changed!

vincentarelbundock commented 7 months ago

Translated from https://github.com/vincentarelbundock/marginaleffects/blob/main/inst/tinytest/test-analytic.R

Fantastic!

There is a MIT license, an acknowledgment that the code is copied from margins, and a reference to Matt Golder's website in the R version. Should we carry those here?

Yes please

Also, the log test fails. This is because self.model.model.exog_names is ['Intercept', 'np.log(x)'] and self.modeldata.columns is ['y', 'x']. So this line empties the variable names. I commented out the test, but maybe you'd prefer to remove it altogether.

Are the numerical results changed, or this is just a labelling issueÉ

LamAdr commented 7 months ago

Are the numerical results changed, or this is just a labelling issueÉ

Not sure what you mean by numerical results since slopes() is not returning. Since in this example, np.log(x) is not accompanied by any variable from the columns, we get ValueError: There is no valid column name in 'variables'.

When such a variable is added into the formula, predictions() is allowed to return and seems to take np.log() into account (the results match those of R).

mod = smf.ols("mpg ~ am + np.log(hp)", data = mtcars.to_pandas()).fit()
nd = datagrid(newdata = mtcars, hp = [100, 200])
p = predictions(mod, newdata=nd)

slopes() seems like it does not take the log into account:

f = "y ~ a + np.log(x)"
N = 10000
dat = pl.DataFrame({'x' : np.random.uniform(size=N), 'a' : np.random.uniform(size=N)})
dat = dat.with_columns(
    y = np.log(pl.col('x')) + np.random.normal(size=N)
)
mod = smf.ols(f, dat.to_pandas()).fit()
nd = datagrid(newdata = dat, x = range(1,5))
res = slopes(mod, newdata = nd, by="a")
print(res)

outputs

┌─────┬──────┬──────────┬──────────┬───┬─────────┬───────┬─────────┬────────┐
│ x   ┆ Term ┆ Contrast ┆ Estimate ┆ … ┆ P(>|z|) ┆ S     ┆ 2.5%    ┆ 97.5%  │
│ --- ┆ ---  ┆ ---      ┆ ---      ┆   ┆ ---     ┆ ---   ┆ ---     ┆ ---    │
│ str ┆ str  ┆ str      ┆ str      ┆   ┆ str     ┆ str   ┆ str     ┆ str    │
╞═════╪══════╪══════════╪══════════╪═══╪═════════╪═══════╪═════════╪════════╡
│ 1   ┆ a    ┆ dY/dX    ┆ 0.005    ┆ … ┆ 0.885   ┆ 0.176 ┆ -0.0626 ┆ 0.0726 │
│ 2   ┆ a    ┆ dY/dX    ┆ 0.005    ┆ … ┆ 0.885   ┆ 0.177 ┆ -0.0626 ┆ 0.0726 │
│ 3   ┆ a    ┆ dY/dX    ┆ 0.005    ┆ … ┆ 0.885   ┆ 0.177 ┆ -0.0626 ┆ 0.0726 │
│ 4   ┆ a    ┆ dY/dX    ┆ 0.005    ┆ … ┆ 0.885   ┆ 0.177 ┆ -0.0626 ┆ 0.0726 │
└─────┴──────┴──────────┴──────────┴───┴─────────┴───────┴─────────┴────────┘

Where 0.005 is the slope with respect to a. I'd be happy to investigate that further and write a nicer issue if thats valuable.

LamAdr commented 7 months ago

@vincentarelbundock ready for review

vincentarelbundock commented 7 months ago

Looks fantastic, thanks!

Yes, please open a separate issue for the problem you identified with a detailed example (maybe mostly copying over what you ahve here already).

Feel free to suggest a solution if you find one.