Closed DomHudson closed 6 years ago
I'm happy to make a PR with a classifier that takes input in the above form, if that is helpful?
Hey @DomHudson ,
That's a super legit question, not a basic one!
Well, the four different classifiers are divided into two groups:
numpy.ndarray
object (at least because a pandas.DataFrame
can be naturally cast to an ndarray
object; it happens all the time if you give an sklearn
classifier a dataframe to fit on). At the moment, this group includes FirstColFtClassifier
and IdxBasedFtClassifier
.pandas.DataFrame
object. At the moment, this group includes FirstObjFtClassifier
and ColLblBasedFtClassifier
.As you have identified correctly, in all cases this is a 2d input. And you are correct, again, in stating that we don't really need the two dimensions. However, scikit-learn
's default classifier API assumes input data is a 2-dimensional numpy.ndarray
object. Now, since the whole purpose of this package is to bridge this difference and adapt the fasttext
code to sklearn
-y format, the wrappers I wrote expect, in turn, to get 2d input, and their main job is to extract from that input the correct text column and forward it in a concise and extensible way to the actual fasttext
code.
I hope this clears things up, but if not, feel free to keep asking questions in this same issue. I am closing it, however, as this is a design choice (and again, the purpose of the package), and not an issue. 😄
Cheers, Shay
Hi;
Thanks for the detailed response; I've just got a quick question if you don't mind?
As the FastText classifier essentially performs text-vectorization as well as classification, would there be benefit in adding a classifier (that confirms to the sklearn fit, fit_proba etc methods) but that takes a 1D text array?
Essentially where I've got to with my thinking is that the FastText classifier is closer in function to an entire pipeline than just the classifier portion of it.
For example:
# Classifier [expects 2D, pre-vectorized input]
# Takes vectors as rows
clf = LogisticRegression()
# Pipeline [expects 1D, non-vectorized input]
# Takes strings as rows
clf = Pipeline([
('vect', CountVectorizer()),
('clf', LogisticRegression())
])
Or is the reason behind not adding this that you would recommend to just using the official python API for this use case?
Many thanks, Dom
Hi,
I'm very sorry for asking such a basic question but can't work this one out! Usually, I see other text classifiers taking one of three forms;
I'm a little confused as the readme does not have a case where multiple tokens are inputted into the model. However, in the tests it appears is that it is trained on a
pd.DataFrame
for X and apd.Series
for y. I believe fasttext does the tokenisation and vectorisation itself, so why do we need a two dimensional input instead of a 1D list of strings? Is there benefit to doing it that way over something like this;or the equivalent but with 1D numpy arrays?
Many thanks! Dom