Unsupervised Parsing - Githubissues

Author Anton Kolonin; Andres Suarez Madrigal

Description Design and implement a syntactic parser that generates unlabelled, undirected dependency parses for a given corpus of sentences. The parser must be unsupervised, meaning that it can only be trained using unannotated text corpora and that it may not contain hard-coded language rules.

Syntactic parsing of text is a crucial component of different Natural Language Processing tasks, as it helps to understand the precise meaning of a given word or component in a sentence. Current state-of-the-art parsers rely either on human-created rules e.g. the Link Grammar Parser or on training on large annotated treebanks e.g. Parsey McParseface, meaning that they require a significant effort from specialized humans to produce. Consequently, only the most popular human languages count with reliable syntactic parsers. This RFAI is part of an ongoing effort to learn a grammar from a corpus of text without any annotations in an unsupervised manner (see this paper and this repository), which would allow for more powerful NLP tools for understanding any language, or even variations of a language (e.g. chatspeak, baby language, etc.).

The goal of the challenge is to produce an unsupervisedly-trained, undirected, unlabeled dependency parser capable of reproducing the evaluation treebanks as close as possible.

Acceptance Criteria

The parser should produce unlabelled, undirected dependency parses, following the same format as the provided treebanks (ULL format). Example:

Examples are silly 0 ###LEFT-WALL### 2 are 1 Examples 2 are 2 are 3 silly

Where “###LEFT-WALL###” is equivalent to the root node in other dependency parsing formalisms (this token is not strictly required, see below). The first line contains the parsed sentence; the following lines contain links between word-pairs, in the format:

position-left-word left-word position-right-word right-word
The parser should be trained with the training data provided, and produce parses with the same tokenization as the evaluation datasets (identical to the one in training data), assuming that space is the token separator.
After training on the entire corpus, the parser should be able to provide parses for any subset of sentences in the corpus.
The output parses should be stored in files with identical names to those in the input text files.

Useful information

The letter case of the produced parses will be ignored during evaluation.
We note that the provided treebanks contain only planar parses, so adding planarity restrictions to the parsing algorithm may be beneficial.
The “###LEFT-WALL###” token and the final dot in a sentence (if it exists) are ignored during evaluation, so in principle it is not required to include them in the output parses. However, given the planarity of the provided parses, they affect the structure in the given treenbanks.
The sentences in the provided data sets are not contiguous in their original sources, so it is most likely convenient to train the parser on a sentence by sentence basis, without reference to neighbouring sentences.

Related documents

http://langlearn.singularitynet.io/data/docs/

Related videos

Training dataset The training dataset to use is a pre-cleaned version of a collection of books for children in English, obtained from the Gutenberg Project, and hence referred as “Gutenberg Children”: Gutenberg Children Corpus

Evaluation treebanks For the benefit of the participants, we also provide three different treebanks used for evaluation. Note that the requested parser must be unsupervised, so the parses in these treebanks should NOT be used during training. They can, however, be used to guide the design of your parser.

A parse corpus of the complete training dataset, referred to as “Bronze Standard”. These parses are obtained using the Link Grammar (LG) parser in English, version 5.5.1. As can be seen, these parses are not perfect, as they are obtained automatically: the LG parser was not able to include the tokens surrounded by brackets in a complete parse tree. Use this treebank only as a reference. [Bronze Standard])(http://langlearn.singularitynet.io/data/parses/English/Gutenberg-Children-Books/LG5.5.1/capital/parses/)
A subset of the above corpus, a “Silver Standard”, which concentrates in one file only those sentences that were parsed completely by the LG parser, and which contain no direct speech fragments inside them: Silver Standard
A “Gold Standard”, which is a subset of 200+ sentences from the Silver Standard, whose parses were manually corrected by a human: Gold Standard

Acceptance Criteria

The algorithm should be able to produce dependency parses for the above Gutenberg Children corpus or any subset of it.
The parser should achieve at least 0.7 F1-score in the Silver Standard, and 0.75 F1-score in the Gold Standard (see metrics below).

Metrics

F1-score comparing the produced parses against reference parses. The treebank score is calculated as the average of the F1-scores of each parse it contains. As mentioned above, the “###LEFT-WALL###” token and the final dot in a sentence (if it exists) are ignored during evaluation.
The parse-evaluator code of the ULL pipeline will be used for evaluation (using option “-i”). Specifically, the entry point for the evaluator code is here.

NON-FUNCTIONAL REQUIREMENTS

The solution should be implemented in Python, Java or Scheme programming languages, or have interface to any of these languages.
The solution should be available under the MIT open source library, optionally using any other libraries under MIT, BSD or Apache licenses or similar licensing terms (right or binary and/or source re-distribution and changes).

Expiration Date 20 June 2020

singnet / rfai-proposal

Unsupervised Parsing #4