ntucllab / libact

Pool-based active learning in Python
http://libact.readthedocs.org/
BSD 2-Clause "Simplified" License
778 stars 174 forks source link

Allow make_query to return multiple items (or the entire scored set) #57

Open ecstar opened 8 years ago

ecstar commented 8 years ago

In certain applications, you might want to know what the top N unlabelled entities are so that a human can go through and do batch labeling offline. Right now I have a particularly hacky way of getting multiple results out, just assuming the majority class in the update, but it would be great to tweak the make_query function to return arbitrary numbers of ordered results for batch label processing. for i in range(20): item_to_investigate = qs.make_query() libact_ds.update(item_to_investigate, 0) print item_to_investigate

Happy to contribute code to try to help this happen!

yangarbiter commented 8 years ago

Sure it would be a nice enhancement to do in the future. The reason that we didn't design interface for batch labeling in the first place is that the algorithms we are implementing are not designed under that kind of setting. Maybe in the future we can start to include some batch-mode active learning algorithms.

Thanks.

wadkar commented 7 years ago

+1 for batch query - this can become a game-changer in cases where crowdworkers are involved for labeling. Many if not most of the MTurk tasks are batch oriented.

However, after initial reading of the AAAI15 ALBL paper, I see what @yangarbiter meant by "algorithms [are not] designed under [batch labeling] setting." If I understand the paper correctly, the underlying algorithms model the multi-armed bandit problem (actually contextual bandit) and this formulation restrict the sampling step to choose a single instance of unlabeled example (i.e. the gambler chooses one arm).

This begs the question of other settings for the multi-armed bandit problem where the bandit can choose multiple arms at one time. Given the framing of the problem, I wonder if anyone has considered such variation to the formulation. If not, it would be very interesting to consider such formulation, paired with a simple crowdworker-backed labeling task.

@ecstar : if its not too much trouble for you, do you mind sharing your hack?

yangarbiter commented 7 years ago

Since most of the query strategies are based on calculating a score for each instance and find the instance with the largest score to query. Thus in theory batch query can be done by selecting n top scored instances to query. But I don't think this may be a good way of doing batch query.