Open bnjhng opened 1 year ago
Hi @bnjhng. Thanks for reaching out.
In terms of the general role of materialization, did you read the generic docs here: https://matthewwardrop.github.io/formulaic/guides/formulae/#materialization ?
In terms of the right solution to this, your approach isn't crazy! It likely misses some edge-cases, but it probably works in the majority of cases. I've been tossing up adding support for dicts like this by just casting them to a pandas DataFrame in the default pandas materializer. Is there a reason that you want to avoid this?
In terms of the general role of materialization, did you read the generic docs here: https://matthewwardrop.github.io/formulaic/guides/formulae/#materialization ?
Thanks for linking it! Somehow I missed that section as I was going through the docs.
I've been tossing up adding support for dicts like this by just casting them to a pandas DataFrame in the default pandas materializer. Is there a reason that you want to avoid this?
We have found that very often, when performing operations that do not involve row indexing (i.e., the vast majority of data transformations), working in dictionary of numpy arrays has speed advantage over pandas DataFrame.
It likely misses some edge-cases, but it probably works in the majority of cases.
This is indeed our conclusion! The main limitation we have encountered so far is with categorical encoding. And we have isolated the main issue to the following:
>>> from formulaic.materializers.types import FactorValues
>>> import pandas as pd
>>> import numpy as np
>>>
>>> # while this works as expected:
>>> print(pd.Series(FactorValues(pd.Series(["a", "b", "c"]))))
0 a
1 b
2 c
dtype: object
>>> # this doesn't give the expected results:
>>> print(pd.Series(FactorValues(np.array(["a", "b", "c"]))))
0 abc
1 bc
2 c
dtype: object
>>> # and this straight up errors out:
>>> print(pd.Series(FactorValues(np.array([1, 2, 3]))))
TypeError: Argument 'values' has incorrect type (expected numpy.ndarray, got FactorValues)
The doctoring of FactorValues
says that it is:
A convenience wrapper that surfaces a
FactorValuesMetadata
instance at<object>.__formulaic_metadata__
. This wrapper can otherwise wrap any object and behaves just like that object.
But clearly, in the case of numpy.ndarray
, the wrapper doesn't behave just like numpy.ndarray
. To get the code above to work properly, one way is to explicitly call the __wrapped__
attribute:
# both of the following works as expected:
>>> print(pd.Series(FactorValues(np.array(["a", "b", "c"])).__wrapped__))
0 a
1 b
2 c
dtype: object
>>> print(pd.Series(FactorValues(np.array([1, 2, 3])).__wrapped__))
0 1
1 2
2 3
dtype: int64
However, there are places in the formulaic code that simply does pandas.Series(data)
instead of pandas.Series(data.__wrapped__)
, for example here and here.
Would it be possible to fix this limitation with formulaic?
Context: My team uses
patsy
heavily. One aspect ofpatsy
that makes it great for our use cases is the fact thatpatsy
isn't strict about its input types. For example,patsy
works when inputtedpd.DataFrame({"a": np.array([1,2,3])})
equally as well as when inputted{"a": np.array([1,2,3])}
, and we have use cases for storing data as a dictionary of numpy arrays.Problem: Currently, we are in the process of placing
patsy
withformulaic
in our workflow, and we encountered a problem because unlikepatsy
,formulaic
throws an error if we try passing in a dictionary of numpy arrays the way we did when usingpatsy
. The error we get isFormulaMaterializerNotFoundError: No materializer has been registered for input type 'builtins.dict'
. However we noticed thatformulaic
's docs mentions:So we suspect that the lack of support for
builtins.dict
isn't a fundamental limitation offormulaic
but rather something we need to provide if we want to useformulaic
the way we have been usingpatsy
.Attempted solution: We did some digging into
materializers
and was able to hack together something that works (i.e., passes all our existing unit tests) without really understanding whatmaterializers
are for and how they work. In essence, we created a customDictMaterializer
and passed that intomodel_matrix
like this (wheredata
in this example is a dictionary of numpy arrays):And we defined
DictMaterializer
thusly:We recognize that our solution is probably not ideal for 3 reasons:
PandasMaterializer
instead ofFormulaMaterializer
. We did this to get something working quickly._combine_columns
method while totally ignoring all the other ones implemented byPandasMaterializer
. Again,_combine_columns
was the only method we had to overload to get our unit tests to pass.na_action="ignore"
and so in our implementation of_combine_columns
, thedrop_rows
input is complete ignored.Help needed: We are reaching out in this forum to see if we are on the right track, and if so, what we can do to get our solution to a more ideal state. It would also be very helpful if we can get a summary/explanation for what
materializers
are for as well as how they are meant to work.@matthewwardrop