michaelmilleryoder / fanfiction-nlp

An NLP processing pipeline for characters in fanfiction. Developed by students at Carnegie Mellon University from 2019-2021.
GNU General Public License v3.0
31 stars 7 forks source link

fanfiction-nlp

An NLP pipeline for extracting information related to characters in fanfiction in English. For each input fanfiction story (or other document), the pipeline produces a list of characters. For each character, the pipeline produces:

More information on the pipeline is available in the paper here. If you use it academically, please cite this work:

Michael Miller Yoder, Sopan Khosla, Qinlan Shen, Aakanksha Naik, Huiming Jin, Hariharan Muralidharan, and Carolyn P Rosé.

  1. FanfictionNLP: A Text Processing Pipeline for Fanfiction. In Proceedings of the 3rd Workshop on Narrative Understanding, pages 13–23.

Contact Michael Miller Yoder <mmyoder [at] pitt.edu> with any questions.

Running the pipeline

This pipeline processes a directory of fanfiction files and extracts text that is relevant for each character.

The pipeline does:

Requirements

The pipeline is written in Python 3. Dependencies are listed below. Sorry about there being so many! We are planning on trimming this down.

A conda environment file that lists these dependencies with tested version numbers is at environment.yml. A new environment with these dependencies can be created with conda env create -n fanfiction-nlp --file environment.yml and then activated with conda activate fanfiction-nlp.

Some additional data and model files are also required:

To run the SpanBERT-based coreference, a model file is required that is 534 MB, unfortunately too big for GitHub's file size limit. That file is available from https://cmu.box.com/s/leg9pkato6gtv9afg6e7tz9auwya2h3n. Please download it and place it in a new directory called model in the spanbert_coref directory.

Run a test

To test that everything is set up properly, run python run.py example.cfg, which by default will run the pipeline on a test story in the example_fandom directory. This will take ~2 GB of RAM to run. The output should be placed in a new directory, output/example_fandom. This output should be the same as that provided in output_test/example_fandom.

Input

Directory path to directory of fanfiction story CSV files.

If your input is raw text you'll need to format it like the examples in the example_fandom directory. Here's an example. Eventually we'll support raw text file input. Columns needed in the input are: fic_id, chapter_id, para_id, text, text_tokenized

Please tokenize text (split into words) before running it through the pipeline and include this as a final column, text_tokenized. We are working on including this as an option. A script, tokenize_fics.py, is included for convenience, though this will require modification to work with your input.

The pipeline uses quite a bit of RAM, mostly depending on the length of the input. It is not recommended to run on stories with greater than 5000 words. Running on stories with 5000 words can use ~20 GB of RAM.

Output

Settings

The pipeline takes settings and input/output filepaths in a configuration file. An example config file is example.cfg. Descriptions of each configuration setting by section are as follows:

[Input/output]

collection_name: the name of the dataset (user-defined)

input_path: path to the directory of input files

output_path: path to the directory where processed files will be stored

[Character coreference]

run_coref: (True or False) Whether to run character coreference.

n_threads: (integer) The number of threads (actually processes) to run the coreference

[Quote attribution]

run_quote_attribution: Whether to run quote attribution (True or False)

n_threads: (integer) The number of threads (actually processes) to run the quote attribution

[Assertion extraction]

run_assertion_extraction: Whether to run assertion extraction (True or False)

n_threads: (integer) The number of threads (actually processes) to run the quote attribution

Command

python run.py <config_file_path>

Notes

This pipeline was inspired by David Bamman's BookNLP.