ntucllab / libact

Pool-based active learning in Python
http://libact.readthedocs.org/
BSD 2-Clause "Simplified" License
776 stars 174 forks source link

HintSVM mldataset - Buffer dtype mismatch error #95

Closed lironesamoun closed 7 years ago

lironesamoun commented 7 years ago

Hi,

I try to use hintSVM query strategy with the vehicle dataset from mldata. However, I don't understand why, I got the following error :

File "testing.py", line 60, in run
    ask_id = qs.make_query()
  File "/usr/local/lib/python3.5/site-packages/libact-0.1.2-py3.5-macosx-10.12-x86_64.egg/libact/query_strategies/hintsvm.py", line 151, in make_query
    np.array([x.tolist() for x in unlabeled_pool]), self.svm_params)
  File "libact/query_strategies/_hintsvm.pyx", line 16, in libact.query_strategies._hintsvm.hintsvm_query (libact/query_strategies/_hintsvm.c:1836)
ValueError: Buffer dtype mismatch, expected 'float64_t' but got 'long'

I don't have this error when I use others strategies (UncertaintySampling,Quire).

def split_scale_train_test(name_dataset,test_size):
    # choose a dataset with unbalanced class instances
    #data = sklearn.datasets.fetch_mldata('segment')
    data = sklearn.datasets.fetch_mldata(name_dataset)

    X = StandardScaler().fit_transform(data['data'])
    target = np.unique(data['target'])
    # mapping the targets to 0 to n_classes-1
    y = np.array([np.where(target == i)[0][0] for i in data['target']])

    X_trn, X_tst, y_trn, y_tst = \
        train_test_split(X, y, test_size=test_size, stratify=y)

    # making sure each class appears ones initially
    init_y_ind = np.array(
        [np.where(y_trn == i)[0][0] for i in range(len(target))])
    y_ind = np.array([i for i in range(len(X_trn)) if i not in init_y_ind])
    trn_ds = Dataset(
        np.vstack((X_trn[init_y_ind], X_trn[y_ind])),
        np.concatenate((y_trn[init_y_ind], [None] * (len(y_ind)))))

    tst_ds = Dataset(X_tst, y_tst)

    fully_labeled_trn_ds = Dataset(
        np.vstack((X_trn[init_y_ind], X_trn[y_ind])),
        np.concatenate((y_trn[init_y_ind], y_trn[y_ind])))

    cost_matrix = 2000. * np.random.rand(len(target), len(target))
    np.fill_diagonal(cost_matrix, 0)

    return trn_ds, tst_ds, y_trn,y_tst, fully_labeled_trn_ds, cost_matrix
def run(trn_ds, tst_ds, lbr, model, qs, quota):
    E_in, E_out = [], []
    score_train = []
    score_test = []

    for _ in range(quota):
        ask_id = qs.make_query()
        X, _ = zip(*trn_ds.data)
        lb = lbr.label(X[ask_id])
        trn_ds.update(ask_id, lb)

        model.train(trn_ds)
        E_in = np.append(E_in, 1 - model.score(trn_ds))
        E_out = np.append(E_out, 1 - model.score(tst_ds))
        score_train = np.append(score_train,model.score(trn_ds)*100)
        score_test = np.append(score_test,model.score(tst_ds)*100)

    return E_in, E_out,score_train,score_test
qs5 = HintSVM(trn_ds5, cl=1.0, ch=1.0, p=0.5)
        model = SVM(kernel='rbf',C = n_C, gamma = n_gamma, decision_function_shape='ovr')
        E_in_5, E_out_5,score_train_5,score_test_5 = run(trn_ds5, tst_ds, idealLabels, model, qs5, quota_to_query)
        results_out.append(E_out_5.tolist())
        results_score.append(score_test_5.tolist())

Do you have any insights about this error ?

thank you

yangarbiter commented 7 years ago

HintSVM handles binary active learning problems only.

I think I may need to add some warning for this.

Thanks for reporting.

lironesamoun commented 7 years ago

Indeed, I read the paper afterward and they specify that it works only on binary problems.

Thank

adithram commented 7 years ago

I am currently having a similar issue in the context of a binary classification problem. I have a set of data that I would like to use active learning to label as either anomalous or non-anomalous based on a small set of labelled data.

Is there a specific format that we have to follow for the features that we feed into the Dataset() function? Or perhaps my understanding of a binary active learning problems is incorrect or my implementation has a significant programming flaw? Any help is appreciated.

Relevant code: (Attaching as screenshot due to issues with markdown)

screen shot 2017-07-25 at 3 32 10 pm

Error Message: screen shot 2017-07-25 at 3 33 16 pm