paperswithcode / sota-extractor

The SOTA extractor pipeline
Apache License 2.0
311 stars 29 forks source link

Automatic SOTA (state-of-the-art) extraction

Aggregate public SOTA tables that are shared under free licences.

Download the scraped data or run the scrapers yourself to get the latest data.

In the future, we are planning to automate the process of extracting tasks, datasets and results from papers.

Getting the data

The data is kept in the data directory. All data is shared under the CC-BY-SA-4 licence.

The data has been parsed into a consistent JSON format, described below.

JSON format description

The format consists of five primary data types: Task, Dataset, Sota, SotaRow and Link.

A valid JSON file is a list of Task objects. You can see examples in the data/tasks folder.

Task

A Task consists of the following fields:

Dataset

A Dataset consists of the following fields:

Link

A Link object describes a URL, and has these two fields:

Sota

A Sota object represents one state-of-the-art table, with these fields:

SotaRow

A SotaRow object represents one line of the SOTA table, it has these fields:

Running the scrapers

Installation

Requires Python 3.6+.

pip install -r requirements.txt

NLP-progress

NLP-progress is a hand-annotated collection of SOTA results from NLP tasks.

The scraper is part of the NLP-progress project.

Licence: MIT

EFF

EFF has annotated a set of SOTA results on a small number of tasks, and produced this great report.

To convert the current content run:

python -m scrapers.eff

Licence: CC-BY-SA-4

SQuAD

The Stanford Question Answering Dataset is an active project for evaluating the question answering task using a hidden test set.

To scrape the current content run:

python -m scrapers.squad

Licence: CC-BY-SA-4

RedditSota

The RedditSota repository lists the best method for a variety of tasks across all of ML.

To scrape the current content run:

python -m scrapers.redditsota

Licence: Apache-2

SNLI

The The Stanford Natural Language Inference (SNLI) Corpus is an active project for Natural Language Inference.

To scrape the current content run:

python -m scrapers.snli

Licence: CC-BY-SA

Cityscapes

Cityscapes is a benchmark for semantic segmentation.

To scrape the current content run:

python -m scrapers.cityscapes

Evaluating the SOTA extraction performance

In the future, this repository will also contain the automatic SOTA extraction pipeline. The aim is to automatically extract tasks, datasets and results from papers.

To evaluate the current prediction performance for all tasks:

python -m extractor.eval_all

The most current report can be seen here: eval_all_report.csv.