ntucllab / libact

Pool-based active learning in Python
http://libact.readthedocs.org/
BSD 2-Clause "Simplified" License
777 stars 175 forks source link

libact can't adequately handle sparse matrices (csr_matrix) #155

Closed kYwzor closed 5 years ago

kYwzor commented 5 years ago

I'm trying to implement Active Learning on sentiment analysis of sentences using a bag-of-words model. To get the model I'm using scikit-learn's CountVectorizer, which outputs a SciPy csr_matrix (a sparse matrix with token counts). Below is an excerpt of my code, which has df as a Pandas dataframe, with df.text being the sentences and df.sentiment their corresponding labels (3 classes).

vectorizer = CountVectorizer(analyzer="word", stop_words=stop_words, min_df=2)
bags = vectorizer.fit_transform(df.text)

X_train, X_test, y_train, y_test = train_test_split(bags, df.sentiment, test_size=0.1)
while len(y_train.unique()) != 3: 
    X_train, X_test, y_train, y_test = train_test_split(bags, df.sentiment, test_size=0.1)

fully_labeled_train_ds = Dataset(X_train, y_train)
test_ds = Dataset(X_test, y_test)
unlabeled_amount = len(y_train) - STARTING_LABELS
train_ds = Dataset(X_train,
                   np.concatenate([y_train[:STARTING_LABELS], [None] * unlabeled_amount]))

lbr = IdealLabeler(fully_labeled_train_ds)
model = LogisticRegression()

qs = UncertaintySampling(train_ds, method='lc', model=model)  # Crashes here

Traceback (most recent call last): File "activeLearning.py", line 41, in qs = UncertaintySampling(train_ds, method='lc', model=model) # Crashes here File "/home/kyw/.local/lib/python3.6/site-packages/libact/query_strategies/uncertainty_sampling.py", line 83, in init self.model.train(self.dataset) File "/home/kyw/.local/lib/python3.6/site-packages/libact/models/logistic_regression.py", line 24, in train return self.model.fit(*(dataset.format_sklearn() + args), **kwargs) File "/home/kyw/.local/lib/python3.6/site-packages/sklearn/linear_model/logistic.py", line 1285, in fit accept_large_sparse=solver != 'liblinear') File "/home/kyw/.local/lib/python3.6/site-packages/sklearn/utils/validation.py", line 756, in check_X_y estimator=estimator) File "/home/kyw/.local/lib/python3.6/site-packages/sklearn/utils/validation.py", line 527, in check_array array = np.asarray(array, dtype=dtype, order=order) File "/home/kyw/.local/lib/python3.6/site-packages/numpy/core/numeric.py", line 501, in asarray return array(a, dtype, copy=False, order=order) ValueError: setting an array element with a sequence

According to libact's docs, the LogisticRegression model interfaces scikit-learn’s logistic regression model, which claims to be able to "handle both dense and sparse input". As such, I believe this is unexpected behaviour, If the code is changed as follows...

bags = vectorizer.fit_transform(df.text)
bags = bags.toarray()

... the rest of the program runs perfectly. However, since my dataset and vocabulary are fairly large (over 65k sentences), I'm running into a lot of out of memory problems, which I believe would be fixed if I could use sparse matrices. I'm not sure what's the root of the issue exactly, but I'm suspicious of the format_sklearn() method, since there's quite a lot of conversions going on there. However, I wasn't able to investigate it thoroughly, so I can't pinpoint the problem yet.

eugene-yang commented 5 years ago

At least the current way of handling sparse matrix in libact.base.dataset is not ideal.

    def __init__(self, X=None, y=None):
        if X is None: X = []
        if y is None: y = []
        self.data = list(zip(X, y))
        self.modified = True
        self._update_callback = set()

list(zip( X,y )) would create a list of sparse feature vector and label pairs. When evoking method get_labeled_entries(), it is just apply a filter to self.data and the output format is not acceptable to scikit-learn fit interface. If using format_sklearn, it is transforming into a numpy array with rows of scipy sparse matrix. This would not be accepted by scikit-learn either. A correct way is to stack them

yangarbiter commented 5 years ago

Yes, currently we don't support sparse feature vector, it should be a documentation error.

eugene-yang commented 5 years ago

Looks like the main reason is caused by get_labeled_entries and get_unlabeled_entries methods in libact.base.dataset.Dataset. Is there any specific reason for returning a single list instead of returning a tuple of feature matrix and list of labels? I searched the package and saw most of the usage of these two methods are eventually zipping the output, making them into a matrix and list. I don't see why zipping them after getting them instead of changing the interface.

eugene-yang commented 5 years ago

I think it is handled by #165.