adithram commented 7 years ago

Actually raising this issue again. Posted as a comment on a closed issue - wasn't sure if the notification system worked the same way with comments on closed issues.

I am currently having the following issue in the context of a binary classification problem. I have a set of data that I would like to use active learning to label as either anomalous or non-anomalous based on a small set of labelled data.

screen shot 2017-07-25 at 3 33 16 pm

Is there a specific format that we have to follow for the features that we feed into the Dataset() function? Or perhaps my understanding of a binary active learning problems is incorrect or my implementation has a significant programming flaw? Any help is appreciated.

Code:

yangarbiter commented 7 years ago

May you post the content of unknown_labels and known_labels.

Also, what would happen if the query strategy is changed to other query strategy? Would the same error raise again?

Thanks.

adithram commented 7 years ago

known_labels: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

unknown_labels is a massive list of nonetype objects

By using RandomSampling, I have no issues with the make_query() function, although, I do have to continue to debug the following error with theideal_labeler:

Traceback (most recent call last): File "active-learning.py", line 145, in <module> lbl = ideal_labeler.label(combined_dataset.data[query_id][0]) File "/Users/ytk326/anaconda2/lib/python2.7/site-packages/libact/labelers/ideal_labeler.py", line 33, in label for x in self.X])[0][0]] IndexError: index 0 is out of bounds for axis 0 with size 0

yangarbiter commented 7 years ago

which numpy version are you using?

In this issue #37, it seems that np.where has different behavior before version 1.11.0. Maybe try to update numpy first?

yangarbiter commented 7 years ago

If you are able to modify the source code, it would be helpful to know if there is any str in the these variableX, y, weight of https://github.com/ntucllab/libact/blob/master/libact/query_strategies/hintsvm.py#L151

adithram commented 7 years ago

I just caught my bug regarding the buffer dtype mismatch. There was a small corner case in my feature connection that was leaving a string in place.

Additionally, while writing this, I've been playing along with hintsvm.py, and there were two places where I made some modifications: I removed the tolist() method from the lines looping through labeled/unlabeled_pool:

Original:

x.tolist() for x in labeled_pool
x.tolist() for x in unlabeled_pool

Error: Traceback (most recent call last): File "active-learning.py", line 148, in <module> query_id = hinted_svm_qs.make_query() File "/Users/ytk326/anaconda2/lib/python2.7/site-packages/libact/query_strategies/hintsvm.py", line 149, in make_query X = [x.tolist() for x in labeled_pool] +\ AttributeError: 'list' object has no attribute 'tolist'

Changed to:

x for x in labeled_pool
x for x in unlabeled_pool

Any thoughts regarding this?

yangarbiter commented 7 years ago

I think previously I assumes labeled_pool and and unlabeled_pool being numpy array. Maybe I should add np.asarray to make sure these two variables are indeed numpy array.

adithram commented 7 years ago

So you suggest storing the dataset as a numpy array prior to passing it to the query strategy function? (Rather than removing the tolist() function calls?

yangarbiter commented 7 years ago

@adithram I think you can work it around like this for now and I'll discuss with @skgg and see where to make the change to the Dataset object.

sian-chen commented 7 years ago

HI @adithram ,

Currently, HintSVM will work only when inputs are float64 numpy array due to the Cython implementation. For now you can transform the lists to numpy arrays to make it work. We will solve this problem soon. Thanks for your reporting.

adithram commented 7 years ago

I can't find the implementation of get_unlabeled_entries() or get_labeled_entries() so I am not positive about this, but isn't the requirement dependent on the output of those functions?

In other words, are you suggesting that I modify the output of those functions to return float64 numpy arrays? Or is simply creating a dataset using two float64 numpy arrays enough to force the desired behavior to occur?

yangarbiter commented 7 years ago

The implementation is here https://github.com/ntucllab/libact/blob/master/libact/base/dataset.py#L159.

The current problem here seems to be that the dataset object did not guaranteed the data to be numpy array with dtype=float64.

I've started a PR to fix this #122 @skgg please check after it passes CI. @adithram please let us know if this patch solves your problem.

adithram commented 7 years ago

UPDATE: I created the datasets using two float64 numpy arrays, but still received the error:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-14-9b39a471ec5b> in <module>()
     21 for i in range(num_queries):
     22     print i
---> 23     query_id = hinted_svm_qs.make_query()
     24     lbl = ideal_labeler.label(combined_dataset.data[query_id][0])
     25     combined_dataset.update(query_id, lbl)

/Users/ytk326/anaconda2/lib/python2.7/site-packages/libact/query_strategies/hintsvm.py in make_query(self)
    154             np.array(y, dtype=np.float64),
    155             np.array(weight, dtype=np.float64),
--> 156             np.array([x.tolist() for x in unlabeled_pool], dtype=np.float64),
    157             self.svm_params)
    158 

AttributeError: 'list' object has no attribute 'tolist'

adithram commented 7 years ago

Additionally, some unusual behavior: After removing the tolist() method, the attribute error is still raised


---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-44-9b39a471ec5b> in <module>()
     21 for i in range(num_queries):
     22     print i
---> 23     query_id = hinted_svm_qs.make_query()
     24     lbl = ideal_labeler.label(combined_dataset.data[query_id][0])
     25     combined_dataset.update(query_id, lbl)

/Users/ytk326/anaconda2/lib/python2.7/site-packages/libact/query_strategies/hintsvm.py in make_query(self)
    154             np.array(y, dtype=np.float64),
    155             np.array(weight, dtype=np.float64),
--> 156             np.array([x for x in unlabeled_pool], dtype=np.float64),
    157             self.svm_params)
    158 

AttributeError: 'list' object has no attribute 'tolist'

adithram commented 7 years ago

Is it okay that my features are constructed as:

[numpy ndarray] of [nump ndarray] of [float64]

In other words, I have a list of feature vectors, where each feature vector is constructed of various float64 values.

sian-chen commented 7 years ago

Hi @adithram ,

It seems you didn't modify the code correctly. The attribute error should not happen if you removed tolist() successfully. You should edit the source code then reinstall it to make the changes happen. We will fix it as soon as possible. Thanks for your reporting.

yangarbiter commented 7 years ago

123 I've removed the tolist() function.

yangarbiter commented 6 years ago

this problem seems fixed

ntucllab / libact

HintSVM - Buffer dtype mismatch error #120

123 I've removed the tolist() function.