Open Vincent-Maladiere opened 1 month ago
Great!!
We need to think about a name. I think LSA is a bit of a technical name that might ring a bell to non technical users.
We brainstormed a bit in terms of name with @jeromedockes and @rcap107 . The name StringEncoder came to mind. It would be close to TextEncoder (https://github.com/skrub-data/skrub/pull/1077), but we feel that the difference is somewhat understandable.
That said, maybe it would be an argument to move the name TextEncoder to SentenceEncoder, which would also be (maybe) a good name because it would be more explicit (link to "SentenceTransformer")
Very interesting! One might wonder why we don't consider the GapEncoder as a string encoder, though. WDYT?
One might wonder why we don't consider the GapEncoder as a string encoder, though. WDYT?
Yes, this was raised, and it is true. I guess that one difference that I make is that the GapEncoder assumes more latent structure (aka dirty-category structure) than open ended strings.
One argument for naming it the "StringEncoder" is that if you really have no prior information on the data or the use of the encoding, it's probably a good default to encode a string. Of course, we'll have to have good "see also" section, and a good discussion in the docs.
Okay, this sounds easy to explain in the doc!
Scikit-learn mentions LSA in TruncatedSVD, and I wonder why it hasn't been implemented in scikit-learn in the first place
Any thoughts @GaelVaroquaux? I'm curious
Scikit-learn mentions LSA in TruncatedSVD, and I wonder why it hasn't been implemented in scikit-learn in the first place
Any thoughts @.***? I'm curious
Probably because it's easy to implement with the tools in scikit-learn and scikit-learn being general (not focused on text or the like) it didn't feel like it should be there.
Problem Description
Latent Semantic Analysis (LSA) consists of a TfidfVectorizer followed by Singular Value Decomposition (SVD). Scikit-learn mentions it in TruncatedSVD, and I wonder why it hasn't been implemented in scikit-learn in the first place, @GaelVaroquaux?
Feature Description
Create the
LSAEncoder
, a simple pipeline chainingTfidfVectorizer
andTruncatedSVD
(or a PCA, both support sparse matrices).Alternative Solutions
No response
Additional Context
No response