nextml / caption-contest-data-api

An API to access data from The New Yorker Caption Contest
https://nextml.github.io/caption-contest-data-api/
Other
60 stars 15 forks source link
dataset machine-learning question-answering

This repository contains data gathered on the NEXT active machine learning system for the New Yorker caption contest. The user sees a similar question to the one below and we provide information to the below questions/the answer/etc:

New Yorker hosting "How funny is this caption?" "Which caption is funnier?"
Cardinal bandits Cardinal bandits Dueling bandits

We provide over 89 million ratings on over 750,000 unique captions during 155 contests after publishing contest 652 on 2019-03-03.

For each response, we record

The data for an individual experiment is in the contests/ directory. For example, caption contest 505 lives in contests/random+adaptive/505/. More description and detail is given in the individual folder.

Dataset high-level description

These datasets are part of the "cartoon caption contest" where given a cartoon users are supposed to write a funny caption. Our algorithms help determine the funniest algorithms (and we also provide unbiased data).

We ask users to rate caption funniness in some way (more detail is provided in each folder). This could be questions like "Is the funnier caption on the left or right?" or "How funny is this comic -- 'unfunny', 'somewhat funny' or 'funny'?"

More detail is given in each folder. Each folder's name corresponds to which caption contest it appeared in.

Data provided

We provide response data for each user. We provide the details on what each query consists of as well as their answer/other information.

We provide

Response data, CSV headers

Dueling bandits headers

Queries of the form "which caption is funnier?"

Cardinal bandits headers

Queries of the form "how funny is this caption?"

Datasets provided

Collection schemes

We decided to generate queries in several ways, either adaptively or randomly. The random data makes no choices based on previous responses on which queries to ask. The adaptive method does make choices based on previous responses to decide which queries to ask.

We consider the randomly collected data to be unbiased and the adaptively collected data to be biased. The randomly collected data is collected under the "RandomSampling" and "RoundRobin" algorithms while the adaptive schemes are with the "LilUCB" algorithm.

Credit

This dataset would not be possible without The New Yorker and their [Caption Contest] as well as The New Yorker cartoonists. More of their work is available at Condé Nast: https://condenaststore.com/conde-nast-brand/cartoonbank

Licenses

Example queries

In earlier experiments, sample queries are also shown (minus 509). Other experiments have specified that the experiment setup being run was either "Cardinal" or "Dueling". In these two experiments, example captions are below:

The user is given a URL and after visiting the URL, the user is presented with a series of queries similar to the one above. There may be variations on the query, described in detail on each experiment.