Add a `get_feature_names_out` method in BaseEmbetter

koaning / embetter

just a bunch of useful embeddings

https://koaning.github.io/embetter/

MIT License

469 stars 15 forks source link

Add a `get_feature_names_out` method in BaseEmbetter #105

Open Vincent-Maladiere opened 3 months ago

Vincent-Maladiere commented 3 months ago

This will allow the parent _SetOutputMixin of TransformerMixin to enable set_output(transform={"pandas", "polars"}

koaning commented 3 months ago

It can't hurt to add. When I started embetter this set_output stuff wasn't really out yet. I do wonder about the use-case though. Pandas and Polars aren't really tensor libraries and it is also pretty hard to interpret a single dimension of an embedding.

koaning commented 3 months ago

The one thing that is slightly tricky is that I do not know the number of dimensions upfront. I only know them when I actually infer. That said, this feels like a cached property, so probably fine if it is calculated only once.

Vincent-Maladiere commented 3 months ago

I do wonder about the use-case though. Pandas and Polars aren't really tensor libraries and it is also pretty hard to interpret a single dimension of an embedding.

Agreed, but sometimes having tensors in a dataframe can be useful for some operations (I'm saying that with skrub in mind in particular). Even if you can't interpret embeddings, when you use them within a heterogenous dataframe, it's nice to keep the context of each column name and specific dtypes.

The one thing that is slightly tricky is that I do not know the number of dimensions upfront. I only know them when I actually infer. That said, this feels like a cached property, so probably fine if it is calculated only once.

Right, so caching during transform would make sense here.

glebzhelezov commented 2 months ago

Hi, I'm a (relatively...) long-time user, first time commenter.

it is also pretty hard to interpret a single dimension of an embedding.

I think this feature would make it more convenient to work with matryoshka embeddings, since it would allow you to drop the dimension of the embeddings by dropping some columns from the data frame. If there are multiple matryoshka embeddings per row, you could reduce all of their dimensions with one regex-defined column selection. (I know this is a stretch, but I have Jupyter notebooks where I've done worse.)

Is anyone working on this? If not, I am happy to work on the issue.

koaning commented 2 months ago

I think this feature would make it more convenient to work with matryoshka embeddings, since it would allow you to drop the dimension of the embeddings by dropping some columns from the data frame.

Would this use-case maybe be better served by having an estimator that you can add to the pipeline as a follow up? Something that can limit the number of columns going out?

Also, could you elaborate on the use-case here where you might have multiple features that you want to embed?

koaning commented 2 months ago

Is anyone working on this? If not, I am happy to work on the issue.

I am open to it, sure, go for it :)