skrub-data / skrub

Prepping tables for machine learning
https://skrub-data.org/
BSD 3-Clause "New" or "Revised" License
1.21k stars 97 forks source link

Add a FastText encoder #1047

Open Vincent-Maladiere opened 2 months ago

Vincent-Maladiere commented 2 months ago

Problem Description

When encoding long text on small datasets, https://arxiv.org/abs/2312.09634 has shown that embeddings improve prediction performance over string models like MinHashEncoder. More recently, CARTE performed well using FastText to initialize column names and category embeddings.

Feature Description

Create an encoder that downloads FastText weights, loads them during fit, and applies them during transform. Note that FastText dependencies are only ["pybind11>=2.2", "numpy"].

Alternative Solutions

Instead, we could create a transformer using SentenceTransformer, which would download weights from HuggingFace. The issue is that although these models provide more powerful embeddings than FastText, this solution would require installing torch, transformers, and finally sentence-transformers. Also, running these models is markedly slower than using FastText.

Additional Context

No response

GaelVaroquaux commented 2 months ago

The problem with fasttext is that you basically need to depend on fasttext, AFAIK, and it only provides this model.

I was more considering the SentenceTransformer way, which would provide much more options.

I'm open to discussion, of course :)

koaning commented 2 months ago

Isn't FastText archived at this point? This is why I dropped it in embetter.

https://github.com/facebookresearch/fastText

Vincent-Maladiere commented 2 months ago

@koaning as long as there is no numpy 3, we should be fine 😉

More seriously, if we are fine with an optional torch dependency and its CI, I'm all for it.

GaelVaroquaux commented 2 months ago

More seriously, if we are fine with an optional torch dependency and its CI, I'm all for it.

Long term, I think that it is important, to implement the patterns in https://arxiv.org/abs/2312.09634, where "diverse entries" get encoded differently than "dirty categories"