matthewwardrop / formulaic

A high-performance implementation of Wilkinson formulas for Python.
MIT License
346 stars 25 forks source link

Allow formatting the categorical encoded variables #158

Open hguturu opened 1 year ago

hguturu commented 1 year ago

Currently they get formatted as C({parameter})[T.{value}] or {parameter}[T.{value}] if its already a string. E.g.,

BinGrp = [0, 0, 0, 1, 1, 1]
becomes
   C(BinGrp)[T.0]  C(BinGrp)[T.1]
0               1               0
1               1               0
2               1               0
3               0               1
4               0               1
5               0               1

It would be nice if we could pass in a format string to get simpler names. E.g. BinGrp0, BinGrp1 if we pass in a format string like "{parameter}{value}"

Moved from https://github.com/matthewwardrop/formulaic/issues/46#issuecomment-1741422950

matthewwardrop commented 1 year ago

I think it would be possible to easily add a format argument to the C() function; and the resulting formulae would look something like:

C(A, format="{variable}:{value}")

But presently the "variable" argument would be the entire C(A, format="{variable}:{value}"), not A. You could potentially fix this, but in principle you could encode A differently multiple times in the same formula... so I'm not sure yet whether this approach is worth pursuing.

Can you suggest a syntax that would make sense for you so we can further evaluate this?

hguturu commented 1 year ago

Good point. I was coming more from the perspective of having easier to handle variable names.

e.g.,

design = formulaic.model_matrix(["C(BinGrp, contr.treatment)"], all_phenotypes)

model = sm.OLS([1,2,3,1,2,3], design).fit()
model.summary()

model.t_test("C(BinGrp, contr.treatment)[T.1] - C(BinGrp, contr.treatment)[T.0]") # impressively works

But, a little cumbersome to do.

Similarly if you had multiple encodings e.g.,

design = formulaic.model_matrix(["C(BinGrp, contr.treatment) + poly(BinGrp) + exp(BinGrp)"], all_phenotypes)
   C(BinGrp, contr.treatment)[T.0]  C(BinGrp, contr.treatment)[T.1]  poly(BinGrp)[1]  exp(BinGrp)
0                                1                                0        -0.408248     1.000000
1                                1                                0        -0.408248     1.000000
2                                1                                0        -0.408248     1.000000
3                                0                                1         0.408248     2.718282
4                                0                                1         0.408248     2.718282
5                                0                                1         0.408248     2.718282

Then I think you would specifiy a format for each one?

Using your suggested syntax:

C(BinGrp, contr.treatment) -> C(BinGrp, contr.treatment, format="{variable}:{value})
poly(BinGrp) -> poly(BinGrp, format="poly_{variable}_{value}")
exp(BinGrp) -> exp(BinGrp, format="{variable}") # e.g. you just want the value transformed but keep the name (silly transform)

                      BinGrp:0                              BinGrp:1   poly_BinGrp_1    BinGrp
0                                1                                0        -0.408248     1.000000
1                                1                                0        -0.408248     1.000000
2                                1                                0        -0.408248     1.000000
3                                0                                1         0.408248     2.718282
4                                0                                1         0.408248     2.718282
5                                0                                1         0.408248     2.718282

If format is not provided it falls back to the default?

matthewwardrop commented 1 year ago

Hmmmm... adding format arguments to every method is not really viable (we are just proxying numpy methods, and this wouldn't work for aliasing variables outside of a function call). We could obviously wrap these methods, but I'm not convinced this is a good idea.

After reflecting more on this, I think sensible (non-mutuially exclusive) ways forward might include:

  1. Adding support for format strings to categorical features to allow overriding the naming of columns combined with their levels. e.g.: C(X, fmt='{variable}.{level}')
  2. Add an aliasing operator along the lines of y ~ ("my_name":=C(X, fmt='...')
  3. Documenting better existing aliasing functionality:
    
    import pandas
    from formulaic import model_matrix
    from formulaic.transforms import C

data = pandas.DataFrame({"X": ['a', 'b', 'c']})

my_var = C(data.X) model_matrix("y ~ my_var", data)



I think I am leaning toward (1) and (3). I would consider implementing within formula aliasing if there were enough demand for it... but remain unconvinced at present.
hguturu commented 1 year ago

I wasn't aware of 3. I tried it and it almost works, but the value var is still formatted differently. e.g.

my_var = C(data.X)
model_matrix("~ my_var", data)

   Intercept  my_var[T.b]  my_var[T.c]
0        1.0            0            0
1        1.0            1            0
2        1.0            0            1

But, I was digging into the code a little bit and I realized there may be a simple enough way to get what is desired (although perhaps not stable across versions due to not being a "blessed" API).

import pandas
from formulaic import model_matrix
import formulaic

data = pandas.DataFrame({"X": ['a', 'b', 'c']})
formulaic.transforms.contrasts.TreatmentContrasts.FACTOR_FORMAT = '{name}.{field}'
model_matrix("~C(X)", data)

   Intercept  C(X).b  C(X).c
0        1.0       0       0
1        1.0       1       0
2        1.0       0       1

This is almost the desired output. The ~C(X) is still being stored in the name. But, perhaps there is a similar hack for this as well? If I can track down where the name is being set.

I could do

from formulaic.transforms import C
my_var = C(data.X)
formulaic.transforms.contrasts.TreatmentContrasts.FACTOR_FORMAT = '{name}.{field}'
model_matrix("~my_var", data)

   Intercept  my_var.b  my_var.c
0        1.0         0         0
1        1.0         1         0
2        1.0         0         1

and that gets me exactly what is needed, but that requires knowing the contrast variables in the formula involves parsing the formula.

By chance, is there a similar format constant I can play with to get the formatting needed without an official format support?

It already works when I don't explicitly ask for a contrast coding, but converting by values to strings.

from formulaic.transforms import C
data.X = data.X.astype(str)
formulaic.transforms.contrasts.TreatmentContrasts.FACTOR_FORMAT = '{name}.{field}'
model_matrix("~ X", data)

   Intercept  X.b  X.c
0        1.0    0    0
1        1.0    1    0
2        1.0    0    1