Open hguturu opened 1 year ago
I think it would be possible to easily add a format
argument to the C()
function; and the resulting formulae would look something like:
C(A, format="{variable}:{value}")
But presently the "variable" argument would be the entire C(A, format="{variable}:{value}")
, not A
. You could potentially fix this, but in principle you could encode A
differently multiple times in the same formula... so I'm not sure yet whether this approach is worth pursuing.
Can you suggest a syntax that would make sense for you so we can further evaluate this?
Good point. I was coming more from the perspective of having easier to handle variable names.
e.g.,
design = formulaic.model_matrix(["C(BinGrp, contr.treatment)"], all_phenotypes)
model = sm.OLS([1,2,3,1,2,3], design).fit()
model.summary()
model.t_test("C(BinGrp, contr.treatment)[T.1] - C(BinGrp, contr.treatment)[T.0]") # impressively works
But, a little cumbersome to do.
Similarly if you had multiple encodings e.g.,
design = formulaic.model_matrix(["C(BinGrp, contr.treatment) + poly(BinGrp) + exp(BinGrp)"], all_phenotypes)
C(BinGrp, contr.treatment)[T.0] C(BinGrp, contr.treatment)[T.1] poly(BinGrp)[1] exp(BinGrp)
0 1 0 -0.408248 1.000000
1 1 0 -0.408248 1.000000
2 1 0 -0.408248 1.000000
3 0 1 0.408248 2.718282
4 0 1 0.408248 2.718282
5 0 1 0.408248 2.718282
Then I think you would specifiy a format for each one?
Using your suggested syntax:
C(BinGrp, contr.treatment) -> C(BinGrp, contr.treatment, format="{variable}:{value})
poly(BinGrp) -> poly(BinGrp, format="poly_{variable}_{value}")
exp(BinGrp) -> exp(BinGrp, format="{variable}") # e.g. you just want the value transformed but keep the name (silly transform)
BinGrp:0 BinGrp:1 poly_BinGrp_1 BinGrp
0 1 0 -0.408248 1.000000
1 1 0 -0.408248 1.000000
2 1 0 -0.408248 1.000000
3 0 1 0.408248 2.718282
4 0 1 0.408248 2.718282
5 0 1 0.408248 2.718282
If format is not provided it falls back to the default?
Hmmmm... adding format
arguments to every method is not really viable (we are just proxying numpy methods, and this wouldn't work for aliasing variables outside of a function call). We could obviously wrap these methods, but I'm not convinced this is a good idea.
After reflecting more on this, I think sensible (non-mutuially exclusive) ways forward might include:
format
strings to categorical features to allow overriding the naming of columns combined with their levels. e.g.: C(X, fmt='{variable}.{level}')
y ~ ("my_name":=C(X, fmt='...')
import pandas
from formulaic import model_matrix
from formulaic.transforms import C
data = pandas.DataFrame({"X": ['a', 'b', 'c']})
my_var = C(data.X) model_matrix("y ~ my_var", data)
I think I am leaning toward (1) and (3). I would consider implementing within formula aliasing if there were enough demand for it... but remain unconvinced at present.
I wasn't aware of 3. I tried it and it almost works, but the value var is still formatted differently. e.g.
my_var = C(data.X)
model_matrix("~ my_var", data)
Intercept my_var[T.b] my_var[T.c]
0 1.0 0 0
1 1.0 1 0
2 1.0 0 1
But, I was digging into the code a little bit and I realized there may be a simple enough way to get what is desired (although perhaps not stable across versions due to not being a "blessed" API).
import pandas
from formulaic import model_matrix
import formulaic
data = pandas.DataFrame({"X": ['a', 'b', 'c']})
formulaic.transforms.contrasts.TreatmentContrasts.FACTOR_FORMAT = '{name}.{field}'
model_matrix("~C(X)", data)
Intercept C(X).b C(X).c
0 1.0 0 0
1 1.0 1 0
2 1.0 0 1
This is almost the desired output. The ~C(X) is still being stored in the name. But, perhaps there is a similar hack for this as well? If I can track down where the name is being set.
I could do
from formulaic.transforms import C
my_var = C(data.X)
formulaic.transforms.contrasts.TreatmentContrasts.FACTOR_FORMAT = '{name}.{field}'
model_matrix("~my_var", data)
Intercept my_var.b my_var.c
0 1.0 0 0
1 1.0 1 0
2 1.0 0 1
and that gets me exactly what is needed, but that requires knowing the contrast variables in the formula involves parsing the formula.
By chance, is there a similar format constant I can play with to get the formatting needed without an official format support?
It already works when I don't explicitly ask for a contrast coding, but converting by values to strings.
from formulaic.transforms import C
data.X = data.X.astype(str)
formulaic.transforms.contrasts.TreatmentContrasts.FACTOR_FORMAT = '{name}.{field}'
model_matrix("~ X", data)
Intercept X.b X.c
0 1.0 0 0
1 1.0 1 0
2 1.0 0 1
Currently they get formatted as
C({parameter})[T.{value}]
or{parameter}[T.{value}]
if its already a string. E.g.,It would be nice if we could pass in a format string to get simpler names. E.g. BinGrp0, BinGrp1 if we pass in a format string like "{parameter}{value}"
Moved from https://github.com/matthewwardrop/formulaic/issues/46#issuecomment-1741422950