huangmeng123 / lit_char_data_wayback

3 stars 1 forks source link

LiSCU Dataset

LiSCU is a dataset of literary pieces and their summaries paired with descriptions of characters that appear in them (presented by "Let Your Characters Tell Their Story": A Dataset for Character-Centric Narrative Understanding). It was created from various online study guides such as shmoop, SparkNotes, CliffsNotes, and LitCharts.

This repo provides a way to reproduce the dataset from Wayback Machine, a time-stamped digital archive of the web.

Requirements

Generating the dataset

First, generate the running script by running the following command.

python generate_run_script.py -o -i --user --password [--skip_scraping]

output_dir is the path to the directory you want to export the dataset to. database_name is the database you want to create for storing the scraped data from Wayback Machine. database_usename and database_user_password are the username and password you prepared for this reproducing process.

Note: If "skip_scraping" is enabled, the generated script will only run processes after the data have been scraped. Only use this flag when you have already run the scraping process completely.

Then, run the generated script to start the reproducing process.

./run.sh

The process may take a few hours to be finished. After the process finishes, you can find all recreated datasets at the output_dir. To be more specific, there will be 4 jsonl files in the output_dir: