Active Learning Yields Poor Results in Multi-Label Task

I am using modAL for an active learning project in multi-label classification. My implementation is in PyTorch, and I use DinoV2 as the backbone model. For the same dataset, I apply both active learning (using minimum confidence and average confidence strategies) and random sampling. I select the same number of samples in both strategies, but the results from random sampling are significantly better than those from the active learning approach. I would like to know if this discrepancy might be due to an issue with my code or the modAL library's handling of multi-label classification. Below is my active learning loop:

for i in range(n_queries):
    if i == 12:
        n_instances = X_pool.shape[0]
    else:
        n_instances = batch(int(np.ceil(np.power(10, POWER))), BATCH_SIZE)

    print(f"\nQuery {i + 1}: Requesting {n_instances} samples from a pool of size {X_pool.shape[0]}")

    if X_pool.shape[0] < n_instances:
        print("Not enough samples left in the pool to query the desired number of instances.")
        break

    query_idx, _ = learner.query(X_pool, n_instances=n_instances)
    query_idx = np.unique(query_idx)

    if len(query_idx) == 0:
        print("No indices were selected, which may indicate an issue with the query function or pool.")
        continue

    # Add the newly selected samples to the cumulative training set
    cumulative_X_train.append(X_pool[query_idx])
    cumulative_y_train.append(y_pool[query_idx])

    # Concatenate all the samples to form the cumulative training data
    X_train_cumulative = np.concatenate(cumulative_X_train, axis=0)
    y_train_cumulative = np.concatenate(cumulative_y_train, axis=0)

    learner.teach(X_train_cumulative, y_train_cumulative)

    # Log the selected sample names
    selected_sample_names = train_df.loc[query_idx, "image"].tolist()
    print(f"Selected samples in Query {i + 1}: {selected_sample_names}")
    with open(samples_log_file, mode='a', newline='') as f:
        writer = csv.writer(f)
        writer.writerow([i + 1] + selected_sample_names)

    # Remove the selected samples from the pool
    X_pool = np.delete(X_pool, query_idx, axis=0)
    y_pool = np.delete(y_pool, query_idx, axis=0)

    # Evaluate the model
    y_pred = learner.predict(X_test_np)
    accuracy = accuracy_score(y_test_np, y_pred)
    f1 = f1_score(y_test_np, y_pred, average='macro')
    acc_test_data.append(accuracy)
    f1_test_data.append(f1)
    print(f"Accuracy after query {i + 1}: {accuracy}")
    print(f"F1 Score after query {i + 1}: {f1}")

    # Early stopping logic
    if f1 > best_f1_score:
        best_f1_score = f1
        wait = 0
    else:
        wait += 1
        if wait >= patience:
            print(f"Stopping early after {i + 1} queries due to no improvement in F1 score.")
            break

    total_samples += len(query_idx)
    print(f"Total samples used for training after query {i + 1}: {total_samples}")
    POWER += 0.25
    torch.cuda.empty_cache()

modAL-python / modAL

Active Learning Yields Poor Results in Multi-Label Task #191