openml / OpenML

Open Machine Learning
https://openml.org
BSD 3-Clause "New" or "Revised" License
668 stars 91 forks source link

Feature Request: Active Classification Task #1179

Closed gkrempl closed 1 year ago

gkrempl commented 1 year ago

Task Type

Active Learning

This task addresses active (machine) learning problems. In the conventional, passive supervised learning of prediction models, there is a given set of labelled instances for learning a model. Importantly, the algorithm passively processes this given training set, without choice in extending it. In active learning [Settles, 2012], labelled data is scarce, and supervised learning of prediction models (for classification or regression) typically has to start with none or very few labelled instances. However, in contrast to passive learning, there is typically additionally a large pool of unlabelled instances. For these unlabelled instances, it is possible to acquire their labels from an oracle. Such an oracle might for example be a human domain expert, an expensive-to-perform experiment, or a costly external data source. Querying a label from such an oracle is expensive, and the aim of active learning techniques is to optimise the selection of instances for labelling. The objective function is minimising labelling costs, while simultaneously maximising the subsequent prediction performance of the model.

Interaction Scenario(s)

Several interaction scenarios between an active learning algorithm and an oracle exist. The most common is pool-based active learning, with a pool of unlabelled candidate instances. The active learning algorithm denoted as query strategy, then iteratively selects instances for labelling from this unlabelled candidate pool. Upon labelling, they are removed from the candidate pool and added to the prediction model's training set. Another interaction scenario is stream-based selective sampling, where the unlabelled instances arrive subsequently as a data stream and are only available for labelling and training at this point in time. This scenario requires an immediate once-and-for-all decision whether to either request a label for the current instance or to leave it unlabelled. A third interaction scenario is query synthesis, where no candidate instances are given. Instead, the active learning algorithm iteratively creates itself de nuovo instances (feature vectors) and requests a label for them. These interaction scenarios differ largely in the interaction protocol, evaluation methodology, and task data they require. The first, pool-based scenario is the most common, and also the one that overlaps best with the current supervised learning task in OpenML. Therefore, this task will solely focus on the pool-based active learning scenario, and leave the stream-based selective sampling and the query synthesis for potential future, separate tasks. In pool-based active learning, training of a prediction model might exclusively use supervised learning on the already labelled data. Alternatively, this supervised learning is combined with semi-supervised learning techniques, by simultaneously using the unlabelled candidate pool for training as well. While the current focus is on the use of supervised techniques, the task will in principle be designed to be usable with semi-supervised techniques as well. While the interaction protocol for classification and regression are similar, each will be covered by a separate active learning task, similar to the already existing separation between (passive) classification and (passive) regression tasks in OpenML.

Evaluation

In practice, active learning will be used to selectively acquire labels only for a small subset of the instances. However, for simulating and evaluating an active learning task in OpenML, a fully labelled data set is required. Initially, the labels of all instances in the training set are hidden (potentially except for a small initial labelled training set). In each active learning step, an active learning algorithm determines the most useful instance to label. Subsequently, the label of that instance is revealed and the classifier is retrained with the updated training set. Simultaneously, at each active learning step, the current performance of the classifier is evaluated on the test set, using the same measures as in passive classification tasks (e.g., accuracy, ACC, or the area under the receiver operating characteristic curve, AUC). This creates a so-called learning curve, which visualises the classification performance in relation to the number of label queries. A dominating query strategy achieves at any learning step a higher classification performance than its dominated counterpart. One might be interested in the performance at a specific learning step X, e.g., when there is a fixed budget that specifies the number of label acquisitions prior. Or, one might be interested in the development of the performance through the active learning steps. Concerning the latter, a common aggregated performance measure is the area under this active learning curve (AULC), or its average. For example, for an experiment over 100 label acquisitions, this might be the average AUC over the corresponding 100 learning steps. An alternative aggregated metric is the data utilisation rate, which corresponds to the number of label acquisitions needed to achieve a target classification performance (the target might be the performance of random sampling). For more details on evaluating active learning, see [KottkeEtal2017].

Exemplary Use Case: Classification for Fraud Detection in Credit Card Transactions

An exemplary use case is the supervised classification of fraud detection in credit card transactions. For such a supervised classification problem, an exemplary data set has been collected by the Machine Learning Group at ULB in collaboration with Worldline. This data set is available on OpenML as CreditCardFraudDetection data set (ID 42175), however only with a subset of features and without labels. The data set with labels for all 284807 transactions is hosted by Kaggle [1]. Due to confidentiality issues, this data set only contains PCA-transformed features and does not state the nature of the original features. However, [Alazizi et al., 2019] list as typical features for such tasks properties associated with the cardholder (e.g., age, city, country), the card (e.g., type and limit), the merchant (e.g., terminal country), and the transaction (e.g., date, time, amount). While such features are automatically recorded for a huge number of transaction instances, obtaining labels as fraudulent or legitimate transaction is very time and money-consuming, and also error-prone [Alazizi et al., 2019]. Active learning, therefore, is potentially of great practical use in this application and has been used for example in [Carcillo et al, 2018], who report that in their application only labels for about 0.2% of the transactions can be queried. An active learning task for this use case starts with a classification task with a fully labelled data set. Thereby, the specification of the test set, the estimation procedure, the evaluation measure, and the target are kept unchanged. In contrast, the training set is split into an optional small initial training set (for which labels are kept), and a larger candidate pool. For this candidate pool, the labels are hidden, by creating a copy of this subset with missing values for the labels. The initial training set is used to initialise a classifier (alternatively, a pre-trained model to start with might be given as a parameter). Subsequently, each active learning step starts by selecting an instance from the unlabelled candidate pool for labelling. The unlabelled instance is removed from the candidate pool, and its labelled original counterpart is added to the training set. The classifier is updated on the expanded training set and evaluated on the test set. The stored information on each learning step must allow to reproduce the classification model and its performance at this stage. This comprises the instance selected for labelling at this step, as well as the predictions for each instance in the test set, and the performance metrics computed on the test set. This allows us to ex-post visualise and compare the learning curves of different active learning strategies, and to compute aggregated measures. In addition, the computation time and memory requirements for instance selection, classifier training, and testing should be captured and stored for each active learning step. As for passive classification, different classifier technologies might be evaluated and compared on the same active learning classification task. In addition, different active learning strategies might be used and compared as well. Thus, the aim of such an evaluation is to provide a realistic estimate of the out-of-sample classification performance when deploying such a combination of classifier technology and query strategy on a related, novel and yet unlabelled data set. The front end should allow visualisation and comparison of performance by classifier, or by query strategy.

Potential Extension: Enabling Experimentation with Hybrid / Combined Approaches

For some classifier technologies such as SVMs, the literature indicates that it might be beneficial to use different classifier technologies during instance selection [TomanekMorik2011]. That is, to use during the active learning steps a selector model, such as a Naive Bayes Classifier, which is used to calculate for example uncertainty estimates for uncertainty sampling as query strategy, while using an SVM as a prediction model once the active learning process has completed. Apart from improving classification performance, another motivation might be training time. That is, to use a fast training model as a selector, and a complex and slowly training model as a predictor. In such cases, in addition to the prediction model, an optional selector model can be specified. The classification performance of the prediction model (e.g., SVM) is used in the performance evaluation, and the selector model (e.g., NB) is only used internally (and for selection time calculation).

Bibliography:

[1] https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud

@inproceedings{Alazizi2019, title={Anomaly detection, consider your dataset first an illustration on fraud detection}, author={Alazizi, Ayman and Habrard, Amaury and Jacquenet, Fran{\c{c}}ois and He-Guelton, Liyun and Obl{'e}, Fr{'e}d{'e}ric and Siblini, Wissam}, booktitle={2019 IEEE 31st international conference on tools with artificial intelligence (ICTAI)}, pages={1351--1355}, year={2019}, organization={IEEE} } https://github.com/Article{CarcilloEtal2018, title={Streaming active learning strategies for real-life credit card fraud detection: assessment and visualization}, author={Carcillo, Fabrizio and Le Borgne, Yann-A{"e}l and Caelen, Olivier and Bontempi, Gianluca}, journal={International Journal of Data Science and Analytics}, volume={5}, number={4}, pages={285--300}, year={2018}, publisher={Springer} } https://github.com/Article{Mohammad2014, title={Predicting phishing websites based on self-structuring neural network}, author={Mohammad, Rami M and Thabtah, Fadi and McCluskey, Lee}, journal={Neural Computing and Applications}, volume={25}, number={2}, pages={443--458}, year={2014}, publisher={Springer} } @book{Settles2012, author = {Settles, Burr}, number = {18}, publisher = {Morgan and Claypool Publishers}, series = {Synthesis Lectures on Artificial Intelligence and Machine Learning}, title = {Active Learning}, year = {2012} } @inproceedings{KottkeEtal2017, author = {Kottke, Daniel and Huseljic, Denis and Calma, Adrian and Krempl, Georg and Sick, Bernhard}, title = {Challenges of reliable, realistic and comparable active learning evaluation}, booktitle = {Proc. of the Workshop and Tutorial on Interactive Adaptive Learning}, year = {2017}, publisher = {CEUR}, series = {Workshop Proceedings}, issn = {1613-0073}, volume = {1924}, }

Required Task Data

The active learning task inherits all task data from the corresponding (passive) classification task.

In addition, the following active learning specific task data is required as pre-defined experiment parameters:

Task Evaluation

For a single step in the active learning process, the active learning task inherits all metrics from the corresponding (passive) classification task, e.g., accuracy, ROC/AUC, and macro/balanced accuracy.

Active learning specific metrics, depending on the metric used in the passive classification task, are:

Experiment Data to Store

The metrics discussed above are for passive and active learning.

In addition, analysing query strategy behaviour is facilitated by storing:

Minimal Experimental Setup

import numpy as np
import matplotlib.pyplot as plt
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.datasets import make_blobs
from skactiveml.pool import UncertaintySampling
from skactiveml.utils import unlabeled_indices, MISSING_LABEL
from skactiveml.classifier import SklearnClassifier
from skactiveml.visualization import plot_decision_boundary, plot_utilities
from sklearn.model_selection import train_test_split

# Generate data set.
X, y_true = make_blobs(n_samples=400, centers=4, random_state=0)

X_train, X_test, y_train_true, y_test = train_test_split(X, y_true, test_size=0.2, train_size=0.8, random_state=0)

# GaussianProcessClassifier needs initial training data otherwise a warning will
# be raised by SklearnClassifier. Therefore, the first 10 instances are used as
# training data.
y_train = np.full(shape=y_train_true.shape, fill_value=MISSING_LABEL)
y_train[:2] = y_train_true[:2]

# Create classifier and query strategy.
clf = SklearnClassifier(GaussianProcessClassifier(random_state=0),classes=np.unique(y_true), random_state=0)
qs = UncertaintySampling(method='entropy')

# Execute active learning cycle.
n_cycles = 20
learning_curve = np.full(shape=n_cycles+1, fill_value=np.nan)
for c in range(n_cycles):
    clf.fit(X_train, y_train)
    learning_curve[c] = clf.score(X_test, y_test)
    query_idx = qs.query(X=X_train, y=y_train, clf=clf)
    y_train[query_idx] = y_train_true[query_idx]

# Fit final classifier.
clf.fit(X_train, y_train)
learning_curve[n_cycles] = clf.score(X_test, y_test)

The resulting learning curve looks like this: https://user-images.githubusercontent.com/4313789/216088473-852ae95a-8d6f-412f-8ed6-62b4907a4e5d.png

An example visualization of the query strategy's usefulness scores for a single time step looks like this: https://user-images.githubusercontent.com/4313789/216088499-afd2d162-21ff-4f6d-975c-f7963a59f4c0.png

Relation to Existing Task Types

The prototype can be found here which also includes a notebook for testing. For convenience purposes, we integrated the SkactivemlExtension into openml-python, like the SklearnExtension:

Here is an example of its use:

from openml.tasks import OpenMLActiveClassificationTask, TaskType
from openml import tasks, runs
from skactiveml.classifier import ParzenWindowClassifier
from skactiveml.pool import RandomSampling

# Get a preexisting OpenMLSupervisedClassificationTask and
# convert it into an OpenMLActiveClassificationTask
task_id = 7555
task = tasks.get_task(task_id)

# Create active learning tasks, which is a subclass of OpenMLSupervisedClassification
# with additional parameters (for now only the annotation budget).
task = OpenMLActiveClassificationTask(
        task_type_id=TaskType.ACTIVE_CLASSIFICATION,
        task_type="ACTIVE_CLASSIFICATION",
        data_set_id=task.dataset_id,
        target_name=task.target_name,
        budget=100,
        task_id=task_id,
        class_labels=task.class_labels
    )

# Create a model consisting of:
# - a query strategy selecting samples for annotation,
# - a prediction model used for evaluation,
# and (optional) a selection model used as part of the query strategy.
model = {
    'query_strategy':RandomSampling(missing_label=None),
    'prediction_model':ParzenWindowClassifier(missing_label=None),
    'selector_model':ParzenWindowClassifier(missing_label=None),
}

# Automatically evaluate your model on the task
run = runs.run_model_on_task(model, task, upload_flow=False, n_jobs=-1, seed=0)
List of added files:
List of modified files:
Detailed changes:

Open Questions / Comments

[proponents of this feature request: Marek Herde, Georg Krempl, Tuan Pham Minh]