Jeopardy
https://www.kaggle.com/tunguz/200000-jeopardy-questions
Interesting dataset because real Jeopardy contestants have used domain-switching as a strategy to outmaneuver other players.
Does represent something of a natural (although admittedly niche) distribution of trivia questions, since it was not specifically created for an NLP task.
Data needs to be cleaned up a little though (remove HTML tags around some questions)
X: question (string)
Y: answer (string)
Domains:
Category
Value ($200, $400, …, $2000)
Domain Shifts:
Covariate Shift: P(answer|question) doesn’t change but P(question) does
Higher value questions tend to be more difficult
Different categories have different questions
Label Shift: P(question|answer) doesn’t change but P(answer) does
N/A here (since p(q|a) changes as well)
Concept Shift:
Many of the questions have factual answers that change over time (e.g., who is the US president)
Jeopardy https://www.kaggle.com/tunguz/200000-jeopardy-questions Interesting dataset because real Jeopardy contestants have used domain-switching as a strategy to outmaneuver other players. Does represent something of a natural (although admittedly niche) distribution of trivia questions, since it was not specifically created for an NLP task. Data needs to be cleaned up a little though (remove HTML tags around some questions) X: question (string) Y: answer (string) Domains: Category Value ($200, $400, …, $2000) Domain Shifts: Covariate Shift: P(answer|question) doesn’t change but P(question) does Higher value questions tend to be more difficult Different categories have different questions Label Shift: P(question|answer) doesn’t change but P(answer) does N/A here (since p(q|a) changes as well) Concept Shift: Many of the questions have factual answers that change over time (e.g., who is the US president)