Open Vincent-Maladiere opened 3 months ago
It can't hurt to add. When I started embetter this set_output
stuff wasn't really out yet. I do wonder about the use-case though. Pandas and Polars aren't really tensor libraries and it is also pretty hard to interpret a single dimension of an embedding.
The one thing that is slightly tricky is that I do not know the number of dimensions upfront. I only know them when I actually infer. That said, this feels like a cached property, so probably fine if it is calculated only once.
I do wonder about the use-case though. Pandas and Polars aren't really tensor libraries and it is also pretty hard to interpret a single dimension of an embedding.
Agreed, but sometimes having tensors in a dataframe can be useful for some operations (I'm saying that with skrub in mind in particular). Even if you can't interpret embeddings, when you use them within a heterogenous dataframe, it's nice to keep the context of each column name and specific dtypes.
The one thing that is slightly tricky is that I do not know the number of dimensions upfront. I only know them when I actually infer. That said, this feels like a cached property, so probably fine if it is calculated only once.
Right, so caching during transform
would make sense here.
Hi, I'm a (relatively...) long-time user, first time commenter.
it is also pretty hard to interpret a single dimension of an embedding.
I think this feature would make it more convenient to work with matryoshka embeddings, since it would allow you to drop the dimension of the embeddings by dropping some columns from the data frame. If there are multiple matryoshka embeddings per row, you could reduce all of their dimensions with one regex-defined column selection. (I know this is a stretch, but I have Jupyter notebooks where I've done worse.)
Is anyone working on this? If not, I am happy to work on the issue.
I think this feature would make it more convenient to work with matryoshka embeddings, since it would allow you to drop the dimension of the embeddings by dropping some columns from the data frame.
Would this use-case maybe be better served by having an estimator that you can add to the pipeline as a follow up? Something that can limit the number of columns going out?
Also, could you elaborate on the use-case here where you might have multiple features that you want to embed?
Is anyone working on this? If not, I am happy to work on the issue.
I am open to it, sure, go for it :)
This will allow the parent
_SetOutputMixin
ofTransformerMixin
to enableset_output(transform={"pandas", "polars"}