harvardnlp / boxscore-data

111 stars 25 forks source link

Data used in Challenges in Data-to-Document Generation (Wiseman, Shieber, Rush; EMNLP 2017). If you use this data, please cite the above paper.

Update (9/3/20): Please consider using the SportSett:Basketball dataset rather than the standard Rotowire dataset described below. Among other things, SportSett:Basketball corrects some dataset contamination issues, where box- and line-scores appear in multiple splits.

Update (1/22/18): Thanks to @janenie for pointing out that some of the line-scores in the data (which report team-level stats) had the team names flipped. Player-level information was not affected. These examples have now been unflipped.

Data

This dataset consists of (human-written) NBA basketball game summaries aligned with their corresponding box- and line-scores. Summaries taken from rotowire.com are referred to as the "rotowire" data, and summaries taken from sbnation.com (and associated team-specific sites) are referred to as the "sbnation" data; we treat these sub-datasets separately, since they are quite different.

To extract the data, run tar -jxvf rotowire.tar.bz2 to form a rotowire/ directory (and similarly for sbnation.tar.bz2).

Rotowire Data

The rotowire data can be found in rotowire/[train|valid|test].json. There are 4853 distinct rotowire summaries, covering NBA games played between 1/1/2014 and 3/29/2017; some games have multiple summaries. The summaries have been randomly split into training, validation, and test sets consisting of 3398, 727, and 728 summaries, respectively.

SBNation Data

The sbnation data can be found in sbnation/[train|valid|test].json. There are 10903 distinct rotowire summaries, covering NBA games played between 11/3/2006 and 3/26/2017; some games have multiple summaries. The summaries have been randomly split into training, validation, and test sets consisting of 7633, 1635, and 1635 summaries, respectively.

Data Format

Each file is utf-8 encoded json, and contains a list of json objects corresponding to each aligned summary/data pair. These json objects have the following fields:

Line-score Objects

Line-score objects have the following fields:

Box-score Objects

Box-score objects contain (column) objects mapping row numbers to values. Rows are numbered from 0 to at most 25, and each row corresponds to a player in the game. In particular, a box-score object contains the following column objects:

Preprocessing Details

Box- and Line-scores

All number values in the box- and line-scores have been converted to integers by rounding if necessary. (So, percents are given as integers between 0 and 100).

Summaries

Summaries are tokenized using nltk, and hyphenated phrases are separated. Tweets and photos were removed from the sbnation summaries, as were any paragraphs that did not contain at least 2 numbers (in either numeric or verbal form).