google-research-datasets / vrdu

We identify the desiderata for a comprehensive benchmark and propose Visually Rich Document Understanding (VRDU). VRDU contains two datasets that represent several challenges: rich schema including diverse data types, complex templates, and diversity of layouts within a single document type.
74 stars 5 forks source link

Benchmark for extractions tasks on visually rich documents.

Data and Tasks

Our paper A Benchmark for Structured Extractions from Complex Documents can be found at https://arxiv.org/abs/2211.15421.

The dataset consists of 2 corpora VRDU-Registration Forms (aka FARA) and VRDU-Ad-buy Forms (aka DeepForm). VRDU-Registration Forms consist of public documents downloaded from the US Department of Justice. VRDU-Ad-buy Forms consist of public documents from FCC PublicFiles. VRDU-Registration Form is the simpler of the two -- containing fewer fields, only three distinct templates, and only simple fields. VRDU-Ad-buy Forms on the other hand consist of more than a dozen fields, dozens of templates (distinct layouts), and more complex fields (nested, repeated fields).

For each corpus, we provide:

dataset.jsonl contains the following attributes:

Tasks

There are three kinds of tasks -- Single Template Learning (STL), Mixed Template Learning (MTL), and Unseen Template Learning (UTL), indicated by "lv1", "lv2", or "lv3" in the name of the split file provided in few_shot-splits:

Each split file contains three list-valued fields: train/valid/split, each with a list of filenames present in the filename attribute in dataset.jsonl. Split files provide examples with 10, 50, 100, and 200 training instances each. The splits are present to mitigate the large variance that may result from sampling different training docs for the few-shot setting.

Evaluation Tools

The evaluation tool is maintained at https://github.com/google-research/google-research/tree/master/vrdu.

The python -m command assumes you are in the google_research/ directory.

Sample invocation of the evaluation binary (on one dataset):

python -m vrdu.evaluate \
--base_dirpath='/path/to/vrdu/registration-form/' \
--extraction_path='/path/to/results/fara-modelFoo/' \
--eval_output_path='/path/to/results/fara-modelFoo-results.csv'

Note that extraction_path contains model outputs of JSON format. Each JSON file corresponds to a task (split), meaning the file name starts with the split name and end with -test_predictions.json.