Open MischaPanch opened 8 months ago
You are assuming that a vectorised computation mechanism for the value generation exists. In many applications, this won't be the case, and the reason for "ColumnGeneratorCachedByIndex is encouraged" is that it provides a more friendly interface for the case where the computation mechanism operates on a per item/row basis, i.e. creating one value given one row of input.
I think what we need to change is primarily the recommendation, i.e. it should be qualified.
Ok, I guess that's reasonable, since using the other way works well with vectorized columngens. Do you want to adjust the docstring? I feel like you usually don't like my style of writing too much ^^
The
ColumnGeneratorCachedByIndex
is recommended for new cached column generators, but it can be significantly slower than the not-recommended way of first creating a ColumnGenerator and then adding cache by wrapping withIndexCachedColumnGenerator
.The reason is that
IndexCachedColumnGenerator
will find all non-cached values and then process them at once (i.e., batch-wise), whereas theColumnGeneratorCachedByIndex
will always loop through all values. Thus, for an initial filling of the cache this can be much slower.Not sure what to do here - one would need to redesign the
ColumnGeneratorCachedByIndex
to not use_generate_value
, but that's a breaking change. Another way would be to write a new class a laVectorizedColumnGeneratorCachedByIndex
, but I honestly feel like batch-wise processing of missing values should be the default behavior