skrub-data / skrub

Prepping tables for machine learning
https://skrub-data.org/
BSD 3-Clause "New" or "Revised" License
1.22k stars 97 forks source link

[FEAT] Add LSA encoder #1121

Open Vincent-Maladiere opened 1 month ago

Vincent-Maladiere commented 1 month ago

Problem Description

Latent Semantic Analysis (LSA) consists of a TfidfVectorizer followed by Singular Value Decomposition (SVD). Scikit-learn mentions it in TruncatedSVD, and I wonder why it hasn't been implemented in scikit-learn in the first place, @GaelVaroquaux?

Feature Description

Create the LSAEncoder, a simple pipeline chaining TfidfVectorizer and TruncatedSVD (or a PCA, both support sparse matrices).

Alternative Solutions

No response

Additional Context

No response

GaelVaroquaux commented 1 month ago

Great!!

We need to think about a name. I think LSA is a bit of a technical name that might ring a bell to non technical users.

We brainstormed a bit in terms of name with @jeromedockes and @rcap107 . The name StringEncoder came to mind. It would be close to TextEncoder (https://github.com/skrub-data/skrub/pull/1077), but we feel that the difference is somewhat understandable.

That said, maybe it would be an argument to move the name TextEncoder to SentenceEncoder, which would also be (maybe) a good name because it would be more explicit (link to "SentenceTransformer")

Vincent-Maladiere commented 1 month ago

Very interesting! One might wonder why we don't consider the GapEncoder as a string encoder, though. WDYT?

GaelVaroquaux commented 1 month ago

One might wonder why we don't consider the GapEncoder as a string encoder, though. WDYT?

Yes, this was raised, and it is true. I guess that one difference that I make is that the GapEncoder assumes more latent structure (aka dirty-category structure) than open ended strings.

One argument for naming it the "StringEncoder" is that if you really have no prior information on the data or the use of the encoding, it's probably a good default to encode a string. Of course, we'll have to have good "see also" section, and a good discussion in the docs.

Vincent-Maladiere commented 1 month ago

Okay, this sounds easy to explain in the doc!

Vincent-Maladiere commented 1 month ago

Scikit-learn mentions LSA in TruncatedSVD, and I wonder why it hasn't been implemented in scikit-learn in the first place

Any thoughts @GaelVaroquaux? I'm curious

GaelVaroquaux commented 1 month ago
Scikit-learn mentions LSA in TruncatedSVD, and I wonder why it hasn't been implemented in scikit-learn in the first place

Any thoughts @.***? I'm curious

Probably because it's easy to implement with the tools in scikit-learn and scikit-learn being general (not focused on text or the like) it didn't feel like it should be there.