This repository contains data gathered on the NEXT active machine learning system for the New Yorker caption contest. The user sees a similar question to the one below and we provide information to the below questions/the answer/etc:
New Yorker hosting | "How funny is this caption?" | "Which caption is funnier?" |
---|---|---|
Cardinal bandits | Cardinal bandits | Dueling bandits |
We provide over 89 million ratings on over 750,000 unique captions during 155 contests after publishing contest 652 on 2019-03-03.
For each response, we record
The data for an individual experiment is in the contests/
directory. For
example, caption contest 505 lives in contests/random+adaptive/505/
. More
description and detail is given in the individual folder.
These datasets are part of the "cartoon caption contest" where given a cartoon users are supposed to write a funny caption. Our algorithms help determine the funniest algorithms (and we also provide unbiased data).
We ask users to rate caption funniness in some way (more detail is provided in each folder). This could be questions like "Is the funnier caption on the left or right?" or "How funny is this comic -- 'unfunny', 'somewhat funny' or 'funny'?"
More detail is given in each folder. Each folder's name corresponds to which caption contest it appeared in.
We provide response data for each user. We provide the details on what each query consists of as well as their answer/other information.
We provide
get_responses_from_next/json_parse.py
Queries of the form "which caption is funnier?"
Queries of the form "how funny is this caption?"
.json
file. The files to parse this
file are included in this repo (json_parse.py
).We decided to generate queries in several ways, either adaptively or randomly. The random data makes no choices based on previous responses on which queries to ask. The adaptive method does make choices based on previous responses to decide which queries to ask.
We consider the randomly collected data to be unbiased and the adaptively collected data to be biased. The randomly collected data is collected under the "RandomSampling" and "RoundRobin" algorithms while the adaptive schemes are with the "LilUCB" algorithm.
This dataset would not be possible without The New Yorker and their [Caption Contest] as well as The New Yorker cartoonists. More of their work is available at Condé Nast: https://condenaststore.com/conde-nast-brand/cartoonbank
caption_contest_data
is licensed by SOFTWARE_LICENSE.txt
LICENSE.txt
.In earlier experiments, sample queries are also shown (minus 509). Other experiments have specified that the experiment setup being run was either "Cardinal" or "Dueling". In these two experiments, example captions are below:
The user is given a URL and after visiting the URL, the user is presented with a series of queries similar to the one above. There may be variations on the query, described in detail on each experiment.