Is there a way to get the baseline value for categorical variables?

matthewwardrop / formulaic

A high-performance implementation of Wilkinson formulas for Python.

MIT License

345 stars 25 forks source link

Is there a way to get the baseline value for categorical variables? #169

Closed grst closed 11 months ago

grst commented 11 months ago

Hi,

I'm looking for a way to retreive the baseline value of categorical variables used in a formula, e.g.

import formulaic
import pandas as pd

data = pd.DataFrame({"celltype": ['A', 'A', 'B', 'C', 'C', 'C'],
        "condition": ['a', 'b', 'a', 'b', 'a', 'b'],
        "cont": [1, 2, 3, 4, 5, 6]})

# (1) Here, I want to retrieve 'a' for condition
design = formulaic.model_matrix('~ condition', data)

# (2) Here, I want to retreive 'b' for condition
design = formulaic.model_matrix('~ C(condition, contr.treatment(base="b"))', data)

# (3) Here, I want to retreive 'b' for condition as well
data["condition"] = pd.Categorical(data["condition"], categories = ["b", "a"], dtype = "category")
design = formulaic.model_matrix('~ condition', data)

I found design.model_spec.encoder_state["condition"][1]['categories'], but it doesn't work in case (2).

Is there a way to achieve this?

Additional context

We are trying to build a "mini language" to specify contrasts for linear models, inspired by glmGamPoi, e.g.

contrast = model.cond(condition='a') - model.cond(condition='b')

In case of multiple variables or interaction terms, we would like to fill the contrast vector for variables that are not specified in cond with the value that refers to the baseline level. See also https://github.com/scverse/multi-condition-comparisions/issues/15

CC @Zethson @const-ae

matthewwardrop commented 11 months ago

Hi @grst ,

Thanks for reaching out. It's an interesting question.

The encoder state has always been intended primarily as a way to ensure reproducibility, and not (as here) a way to introspect the state to then build out automation/tooling. The layout of this encoding state is also not guaranteed to remain fixed between major versions of Formulaic (it being considered an implementation detail); and so while changes at this point are relatively unlikely, you could find your code not working in the future. AND not all contrast encodings "leave one out" in a nice parsimonious way.

This is definitely an interesting problem, though. If you do not need support for arbitrary contrasts, you could override the C stateful transform with your own version that only supports dummy encoding, and that adds the base level to the encoder state. Would that work for you? That would side-step a lot of these issues. Otherwise, I can think a bit more about this.

Also, in case you were unaware, formulaic also offers tooling that could be used to generate contrast matrices (though it was not originally intended for that purpose). Likely it is not helpful for you here, since it uses a very different syntax, but... here you go anyway (following the conventions of https://matthewwardrop.github.io/formulaic/guides/contrasts/#how-does-contrast-coding-work):

>>> import numpy
>>> from formulaic.utils.constraints import LinearConstraints
>>> Z = LinearConstraints.from_spec("a, b-a, c-a", variable_names=["a", "b", "c"]).constraint_matrix
>>> C = numpy.linalg.inv(Z)
>>> Z, C
(array([[ 1.,  0.,  0.],
        [-1.,  1.,  0.],
        [-1.,  0.,  1.]]),
 array([[1., 0., 0.],
        [1., 1., 0.],
        [1., 0., 1.]]))

matthewwardrop commented 11 months ago

Given your emoticon response, I'll assume that the custom transform route best suits you... but feel free to reopen if you need more guidance or want me to revisit it!

grst commented 11 months ago

Thanks for the detailed response and sorry for not responding earlier! I'll check out your suggestion with the custom transform class and let you know in case I have further questions :)

grst commented 10 months ago

Hi @matthewwardrop,

I now have a follow-up question:

I got a PoC to work by passing the overridden C via context. This works well when the C is explicitly specified in the formula. However, this doesn't capture the variables that are implicitly treated as categorical (in this case condition). Is there a way to override the default behavior as well?

from formulaic.transforms import C
import formulaic
import pandas as pd

data = pd.DataFrame({"celltype": ['A', 'A', 'B', 'C', 'C', 'C'],
        "condition": ['a', 'b', 'a', 'b', 'a', 'b'],
        "cont": [1, 2, 3, 4, 5, 6]})

def custom_C(data, contrasts = None, *, levels = None, spans_intercept = True):
    print("variable", data.name)
    print("contrasts", contrasts)
    print("levels", levels)

    return C(data, contrasts, levels=levels, spans_intercept=spans_intercept)

formulaic.model_matrix("~condition + C(celltype, contr.treatment)", data=data, context = {"C": custom_C})

variable celltype
contrasts <class 'formulaic.transforms.contrasts.TreatmentContrasts'>
levels None
   Intercept  condition[T.b]  C(celltype, contr.treatment)[T.B]  C(celltype, contr.treatment)[T.C]
0        1.0               0                                  0                                  0
1        1.0               1                                  0                                  0
2        1.0               0                                  1                                  0
3        1.0               1                                  0                                  1
4        1.0               0                                  0                                  1
5        1.0               1                                  0                                  1

grst commented 10 months ago

Hi @matthewwardrop, sorry for the renewed ping - I was just afraid my message got lost in the Christmas break. Would be great to get your input on my last comment!

matthewwardrop commented 5 months ago

Hi @grst,

Apologies for the delay (I feel I'm always so apologising)... life is and remains pretty busy!

Yeah, catching the default categorisation is a little trickier. You can see the default implementation for pandas materializers here: https://github.com/matthewwardrop/formulaic/blob/main/formulaic/materializers/pandas.py#L95C9-L95C28 . You would have to subclass this.

If this is going to be very useful to you, then it might be to others also... so I'd be willing to extend the transform state with information about the baseline too. The amount of redundant information is relatively small, so I'm not worried from that perspective. If you still want this, let me know and I can add it in.

grst commented 5 months ago

Hi @matthewwardrop,

Apologies for the delay (I feel I'm always so apologising)... life is and remains pretty busy!

I know what you are talking about, so no worries ;)

In the meanwhile, I found a solution that indeed involves subclassing the Python materializer. By hooking into _encode_evaled_factor and _flatten_encoded_evaled_factor, I was able to exfiltrate, amongst others, the drop_field, categories and base category: https://github.com/scverse/pertpy/blob/74914a92b5ab49de76f47cde3f262ed8d87e2829/pertpy/tools/_differential_gene_expression/_formulaic.py#L94-L167

Maybe you could take a quick look and let me know if you could envisage any problems that might arise due to future updates of formulaic?

Of course it would be cool if there was a more straightforward solution, but I can also live with what I have now. At least it passes all my test cases.