"Vectorize" ColumnGeneratorCachedByIndex

MischaPanch commented 8 months ago

The ColumnGeneratorCachedByIndex is recommended for new cached column generators, but it can be significantly slower than the not-recommended way of first creating a ColumnGenerator and then adding cache by wrapping with IndexCachedColumnGenerator.

The reason is that IndexCachedColumnGenerator will find all non-cached values and then process them at once (i.e., batch-wise), whereas the ColumnGeneratorCachedByIndex will always loop through all values. Thus, for an initial filling of the cache this can be much slower.

Not sure what to do here - one would need to redesign the ColumnGeneratorCachedByIndex to not use _generate_value, but that's a breaking change. Another way would be to write a new class a la VectorizedColumnGeneratorCachedByIndex, but I honestly feel like batch-wise processing of missing values should be the default behavior

opcode81 commented 8 months ago

You are assuming that a vectorised computation mechanism for the value generation exists. In many applications, this won't be the case, and the reason for "ColumnGeneratorCachedByIndex is encouraged" is that it provides a more friendly interface for the case where the computation mechanism operates on a per item/row basis, i.e. creating one value given one row of input.

I think what we need to change is primarily the recommendation, i.e. it should be qualified.

MischaPanch commented 8 months ago

Ok, I guess that's reasonable, since using the other way works well with vectorized columngens. Do you want to adjust the docstring? I feel like you usually don't like my style of writing too much ^^

opcode81 / sensAI

"Vectorize" ColumnGeneratorCachedByIndex #80