matthewwardrop / formulaic

A high-performance implementation of Wilkinson formulas for Python.
MIT License
347 stars 25 forks source link

Explicitly passed terms should be sorted like parsed terms. #114

Closed rchui closed 2 years ago

rchui commented 2 years ago

When Formula receives a string as input, it sorts the terms when preparing the items. This is inconsistent because none of Structured, List, or Set sort their terms. String formulas should likewise not be sorting their terms to maintain consistency.

matthewwardrop commented 2 years ago

Hi @rchui . Thanks for sharing your thoughts here.

Actually, the sorting of terms is a feature. Per the docs (such as they are, improvements are incoming):

Formula terms in Formulaic are always sorted first by the order of the interaction, and then alphabetically. In R and patsy, this second ordering is done in the order that columns were introduced to the formula (patsy additionally sorts by which fields are involved in the interactions). As a result formulas generated by formulaic with the same set of fields will always generate the same model matrix.

Technically, this is a feature of the FormulaParser. You could implement your own parser that didn't generate ordered terms if you were so inclined, but I think that would be a downgrade. If users provide explicit lists of terms (structured or not), we don't currently reorder the terms under the assumption that they have intentionally ordered the Term instances in a specific order. If anything, I would lean into reordering those terms as well rather than avoiding ordering at all.

As for Structured, List and Set, these are generic containers, and should not be making decisions around ordering. (And indeed Set does not have an order at all). I don't think there's a consistency argument here that extends to Formula instances.

Happy to discuss further if you think I'm missing something, but I'm going to close this one out for now. And thanks again!

rchui commented 2 years ago

If the intention is to keep re-ordering then I would also advocate for reordering in the Structured, List, and Set cases so that they remain consistent. It ensure that the model matrix is identical regardless of the type of input that was given.

matthewwardrop commented 2 years ago

@rchui Can you give me an example of where this is biting you? The thing that currently determines whether terms are ordered is nothing to do with the container (Structured, List, set); but whether the formula is being parsed or posited.

For example, the following are all equivalent and ordered:

Formula("d + b + a + 1")
Formula(Structured("d + b + a + 1"))
Formula(["d+b+a+1"])
Formula({"d+b+a+1"})

The resulting terms are ordered by the FormulaParser instance, and not by Formula.

If you manually specify the terms, for example:

Formula(["a", "b", "d", "1"])

Then the order will be what you input. I think I agree that these should/could be ordered also, it's just not a use-case I've currently thought too much about. I see no downside to sorting these (and marginal upside), so I can go ahead and get this implemented.

rchui commented 2 years ago

I agree the upside is marginal. It's not biting me so much as it feels like a poor user experience. When writing tests to account for the behavior, I would expect all input types to have similar behavior and the difference between a string vs container input seems arbitrary for sorting vs non-sorting behavior. It was more surprising, than anything else, that the behavior varied based on input type.