Open Vincent-Maladiere opened 2 months ago
The problem with fasttext is that you basically need to depend on fasttext, AFAIK, and it only provides this model.
I was more considering the SentenceTransformer way, which would provide much more options.
I'm open to discussion, of course :)
Isn't FastText archived at this point? This is why I dropped it in embetter.
@koaning as long as there is no numpy 3, we should be fine 😉
More seriously, if we are fine with an optional torch dependency and its CI, I'm all for it.
More seriously, if we are fine with an optional torch dependency and its CI, I'm all for it.
Long term, I think that it is important, to implement the patterns in https://arxiv.org/abs/2312.09634, where "diverse entries" get encoded differently than "dirty categories"
Problem Description
When encoding long text on small datasets, https://arxiv.org/abs/2312.09634 has shown that embeddings improve prediction performance over string models like
MinHashEncoder
. More recently, CARTE performed well using FastText to initialize column names and category embeddings.Feature Description
Create an encoder that downloads FastText weights, loads them during
fit
, and applies them duringtransform
. Note that FastText dependencies are only ["pybind11>=2.2", "numpy"].Alternative Solutions
Instead, we could create a transformer using
SentenceTransformer
, which would download weights from HuggingFace. The issue is that although these models provide more powerful embeddings than FastText, this solution would require installingtorch
,transformers
, and finallysentence-transformers
. Also, running these models is markedly slower than using FastText.Additional Context
No response