howardyclo / grammar-pattern

Extract and align grammar patterns from English sentences.
54 stars 10 forks source link
chunking grammar grammar-parser grammar-pattern grammar-rules shallow-parser

grammar-pattern

This repo offers several python (3.x) modules for grammatical analysis:

  1. Extracting grammar patterns from sentences. For example, the grammar pattern for "discuss" in the sentence "He likes to discuss the issues ." would be "V n".
  2. Aligning grammar patterns from parallel sentences. For example, grammatically erroneous source sentence "He likes to discuss about the issues ." → grammatically correct target sentence "He likes to discuss the issues", the aligned grammar pattern for "discuss" would be "V about n" → "V n".

We currently support grammar patterns for verb, noun and adjective headwords. See what grammar pattern is in Wikipedia.

Setup

Before starting to use modules, please install the python dependencies (mainly spaCy and NLTK):

$ pip install -r requirements.txt

$ python -m spacy download en_core_web_lg 

You can simply run test.py to check if we miss some required modules or data.

$ python test.py

Example Usages

Here we demonstrate how to test our shallow parser, extract grammar patterns for a sentence or align grammar patterns for parallel sentences.

0. Preprocess the sentences (See How to use AllenNLP Constituency Tree Parser)

Run an existing constituency tree parser to get linearized constituency tree string for every sentence as a pre-processing step. The constituency tree parser we use is AllenNLP. They have also an online demo.

Alt text

1. Import modules

from modules.shallow_parser import shallow_parse
from modules.grampat import sent_to_pats, align_parallel_pats

2. Get shallow parsed results from sentences

# source sentence: "He liked to discuss about the issues ."
# target sentence: "He likes to discuss the issues ."
# Note that we parse sentences in advance using AllenNLP's constituency tree parser.

src_parsed = shallow_parse("(S (NP (PRP He)) (VP (VBD liked) (S (VP (TO to) (VP (VB discuss) (PP (IN about) (NP (DT the) (NNS issues))))))) (. .))")
tgt_parsed = shallow_parse("(S (NP (PRP He)) (VP (VBZ likes) (S (VP (TO to) (VP (VB discuss) (NP (DT the) (NNS issues)))))) (. .))")
print(src_parsed)

[[['He'], ['liked'], ['to'], ['discuss'], ['about'], ['the', 'issues'], ['.']],
 [['he'], ['like'], ['to'], ['discuss'], ['about'], ['the', 'issue'], ['.']],
 [['PRP'], ['VBD'], ['TO'], ['VB'], ['IN'], ['DT', 'NNS'], ['.']],
 [['H-NP'], ['H-VP'], ['H-VP'], ['H-VP'], ['H-PP'], ['I-NP', 'H-NP'], ['O']]]
print(tgt_parsed)

[[['He'], ['likes'], ['to'], ['discuss'], ['the', 'issues'], ['.']],
 [['he'], ['like'], ['to'], ['discuss'], ['the', 'issue'], ['.']],
 [['PRP'], ['VBZ'], ['TO'], ['VB'], ['DT', 'NNS'], ['.']],
 [['H-NP'], ['H-VP'], ['H-VP'], ['H-VP'], ['I-NP', 'H-NP'], ['O']]]

shallow_parse() returns a list of chunked elements:

Note that the prefix HIO of chunk tags represents:

3. Extract grammar patterns from sentences

src_pats = sent_to_pats(src_parsed)
tgt_pats = sent_to_pats(tgt_parsed)
print(src_pats)

[('LIKE', 'V to v', 'liked to discuss', (1, 3)),
 ('DISCUSS', 'V about n', 'discuss about the issues', (3, 5))]
print(tgt_pats)

[('LIKE', 'V to v', 'likes to discuss', (1, 3)),
 ('DISCUSS', 'V n', 'discuss the issues', (3, 4))]

sent_to_pats() returns a list of tuples, each tuple contains:

How does sent_to_pats() works:

4. Align grammar patterns for parallel sentences

parallel_pats = align_parallel_pats(src_pats, tgt_pats)
print(parallel_pats)

[[('LIKE', 'V to v', 'liked to discuss', (1, 3)),
  ('LIKE', 'V to v', 'likes to discuss', (1, 3))],
 [('DISCUSS', 'V about n', 'discuss about the issues', (3, 5)),
  ('DISCUSS', 'V n', 'discuss the issues', (3, 4))]]

align_parallel_pats() returns a list of aligned grammar patterns.

What's Next?

Now that you've completed the Example Usages guide, we can use these modules to count grammar patterns for large English monolingual corpora (BNC) and parallel grammatical error correction corpora (EFCAMDAT, LANG-8, CLC-FCE). We released a python script for doing this (support multi-processing):

$ python compute_grampat.py \
-in_src_path data/src.tree.txt \
-in_tgt_path data/tgt.tree.txt \
-out_path data \
-out_prefix dataset_name \
-n_jobs 4 \
-batch_size 1024

The data structure of the output file data/dataset_name.grampat.dill is a Python Dictionary containing two keys:

We released grammar pattern results for BNC, EFCAMDAT, LANG-8 and CLC-FCE. It can be used for grammatical analysis (See query_grampat.py for example usage).

Citation

If you find the repo helpful for your research, you can cite it with the following BibTeX:

@software{yi_chen_howard_lo_2020_3611412,
  author       = {Yi-Chen Howard Lo},
  title        = {howardyclo/grammar-pattern},
  month        = jan,
  year         = 2020,
  publisher    = {Zenodo},
  version      = {v1.0.0},
  doi          = {10.5281/zenodo.3611412},
  url          = {https://doi.org/10.5281/zenodo.3611412}
}

or clicking this badge DOI to export any format you like (on the right hand side of the website).