Interaction between two categorical covariates sometimes switches order, causing error

martiningram commented 1 year ago

Hi formulaic developers,

Thank you very much for your work putting this great library together! I'm enjoying it quite a lot, but I recently ran into a bit of a thorny issue. I have a term in a formula that is C(frog_type):C(is_cargo). Both of these are categorical covariates, with two levels (stiff and movable for frog_type, and yes and no for is_cargo). Generally this works fine but occasionally creating the design matrix when predicting on new data throws an error:

formulaic.errors.FactorEncodingError: Term `C(frog_type):C(is_cargo)` has generated columns that are inconsistent with specification: generated ['C(is_cargo)[T.yes]', 'C(frog_type)[T.stiff]:C(is_cargo)[T.
no]', 'C(frog_type)[T.stiff]:C(is_cargo)[T.yes]'], expecting ['C(is_cargo)[T.yes]', 'C(is_cargo)[T.no]:C(frog_type)[T.stiff]', 'C(is_cargo)[T.yes]:C(frog_type)[T.stiff]'].

What I think is going on here is that the two specifications are equivalent, and the factor simplification occasionally chooses one rather than the other. Unfortunately I am not sure how to fix it, but any workarounds / fixes would be much appreciated.

Thanks for your help! All the best, Martin

matthewwardrop commented 1 year ago

Hi Martin! Thanks for taking the time to report this bug.

This ought not to be possible because terms are ordered based on the formula, and the algorithm should in principle be deterministic after this. And I don't seem to be able to reproduce it. The following works fine for me:

import formulaic
import numpy
import pandas

for n in (10, 100, 1000):
    df = pandas.DataFrame({
        "frog_type": numpy.random.choice(["stiff", "movable"], n),
        "is_cargo": numpy.random.choice([True, False], n),
    })
    for i in range(1000):
        formulaic.model_matrix("C(frog_type):C(is_cargo)", df).model_spec.get_model_matrix(df)

Do you have an example you can share of when this breaks?

martiningram commented 1 year ago

Thanks Matthew for replying so quickly, I really appreciate it!

I'll have to work on an example; it's slightly tricky because I won't be able to share the actual data I'm working with.

One question: the error actually appeared originally using formulaic 0.5.2. I save a pickle file, restore it, and then do these computations. The error also persisted with the lastest formulaic version and this pickle file, but I am now wondering whether maybe it's because the pickle file was saved with the old version. Basically, my question is: do you think it could be that this was possible in 0.5.2 but has been fixed since?

matthewwardrop commented 1 year ago

Ah... yes. Version 0.6.0 changed the ordering of terms, as described in the release notes, so if you are using older model specs that will be problematic.

Also note that (as described in the docs here), reusing model specs from older versions of formulaic is not supported. In most cases, it will work fine, but there will be occasional issues which we do not plan to support at this time.

If things consistently fail for the same formula/model spec, that is expected here. If it stochastically fails, then this is more concerning, since both 0.5.x and 0.6.x have stable (but different) term ordering strategies.

I'll close this one out for now, but feel free to reopen if you think there's any remaining issues not dealt with above!

And thanks again!

matthewwardrop / formulaic

Interaction between two categorical covariates sometimes switches order, causing error #146