Closed rchui closed 2 years ago
Hi @rchui . Thanks for sharing your thoughts here.
Actually, the sorting of terms is a feature. Per the docs (such as they are, improvements are incoming):
Formula terms in Formulaic are always sorted first by the order of the interaction, and then alphabetically. In R and patsy, this second ordering is done in the order that columns were introduced to the formula (patsy additionally sorts by which fields are involved in the interactions). As a result formulas generated by formulaic with the same set of fields will always generate the same model matrix.
Technically, this is a feature of the FormulaParser
. You could implement your own parser that didn't generate ordered terms if you were so inclined, but I think that would be a downgrade. If users provide explicit lists of terms (structured or not), we don't currently reorder the terms under the assumption that they have intentionally ordered the Term
instances in a specific order. If anything, I would lean into reordering those terms as well rather than avoiding ordering at all.
As for Structured
, List
and Set
, these are generic containers, and should not be making decisions around ordering. (And indeed Set
does not have an order at all). I don't think there's a consistency argument here that extends to Formula
instances.
Happy to discuss further if you think I'm missing something, but I'm going to close this one out for now. And thanks again!
If the intention is to keep re-ordering then I would also advocate for reordering in the Structured
, List
, and Set
cases so that they remain consistent. It ensure that the model matrix is identical regardless of the type of input that was given.
@rchui Can you give me an example of where this is biting you? The thing that currently determines whether terms are ordered is nothing to do with the container (Structured
, List
, set
); but whether the formula is being parsed or posited.
For example, the following are all equivalent and ordered:
Formula("d + b + a + 1")
Formula(Structured("d + b + a + 1"))
Formula(["d+b+a+1"])
Formula({"d+b+a+1"})
The resulting terms are ordered by the FormulaParser
instance, and not by Formula
.
If you manually specify the terms, for example:
Formula(["a", "b", "d", "1"])
Then the order will be what you input. I think I agree that these should/could be ordered also, it's just not a use-case I've currently thought too much about. I see no downside to sorting these (and marginal upside), so I can go ahead and get this implemented.
I agree the upside is marginal. It's not biting me so much as it feels like a poor user experience. When writing tests to account for the behavior, I would expect all input types to have similar behavior and the difference between a string vs container input seems arbitrary for sorting vs non-sorting behavior. It was more surprising, than anything else, that the behavior varied based on input type.
When Formula receives a string as input, it sorts the terms when preparing the items. This is inconsistent because none of
Structured
,List
, orSet
sort their terms. String formulas should likewise not be sorting their terms to maintain consistency.