shaypal5 / skift

scikit-learn wrappers for Python fastText.
MIT License
234 stars 23 forks source link

1D array input for training #4

Closed DomHudson closed 6 years ago

DomHudson commented 6 years ago

Hi,

I'm very sorry for asking such a basic question but can't work this one out! Usually, I see other text classifiers taking one of three forms;

  1. (1D) List of strings, if it performs tokenisation and vectorisation itself
  2. (2D) List of tokens if it performs vectorisation itself
  3. (2D) List of vectors if it is just a classifier

I'm a little confused as the readme does not have a case where multiple tokens are inputted into the model. However, in the tests it appears is that it is trained on a pd.DataFrame for X and a pd.Series for y. I believe fasttext does the tokenisation and vectorisation itself, so why do we need a two dimensional input instead of a 1D list of strings? Is there benefit to doing it that way over something like this;

FtClassifier().fit(
    ['Input 1', 'Input 2'],
    [1, 0]
)

or the equivalent but with 1D numpy arrays?

Many thanks! Dom

DomHudson commented 6 years ago

I'm happy to make a PR with a classifier that takes input in the above form, if that is helpful?

shaypal5 commented 6 years ago

Hey @DomHudson ,

That's a super legit question, not a basic one!

Well, the four different classifiers are divided into two groups:

  1. Those that assume that the input data X is (at least) an numpy.ndarray object (at least because a pandas.DataFrame can be naturally cast to an ndarray object; it happens all the time if you give an sklearn classifier a dataframe to fit on). At the moment, this group includes FirstColFtClassifier and IdxBasedFtClassifier.
  2. Those that assume that the input data X is a pandas.DataFrame object. At the moment, this group includes FirstObjFtClassifier and ColLblBasedFtClassifier.

As you have identified correctly, in all cases this is a 2d input. And you are correct, again, in stating that we don't really need the two dimensions. However, scikit-learn's default classifier API assumes input data is a 2-dimensional numpy.ndarray object. Now, since the whole purpose of this package is to bridge this difference and adapt the fasttext code to sklearn-y format, the wrappers I wrote expect, in turn, to get 2d input, and their main job is to extract from that input the correct text column and forward it in a concise and extensible way to the actual fasttext code.

I hope this clears things up, but if not, feel free to keep asking questions in this same issue. I am closing it, however, as this is a design choice (and again, the purpose of the package), and not an issue. 😄

Cheers, Shay

DomHudson commented 6 years ago

Hi;

Thanks for the detailed response; I've just got a quick question if you don't mind?

As the FastText classifier essentially performs text-vectorization as well as classification, would there be benefit in adding a classifier (that confirms to the sklearn fit, fit_proba etc methods) but that takes a 1D text array?

Essentially where I've got to with my thinking is that the FastText classifier is closer in function to an entire pipeline than just the classifier portion of it.

For example:

# Classifier [expects 2D, pre-vectorized input]
# Takes vectors as rows
clf = LogisticRegression() 

# Pipeline [expects 1D, non-vectorized input]
# Takes strings as rows
clf = Pipeline([
    ('vect', CountVectorizer()),
    ('clf', LogisticRegression())
])

Or is the reason behind not adding this that you would recommend to just using the official python API for this use case?

Many thanks, Dom