REGEL: Regular Expression Generation from Examples and Language

This is the code repository for the paper "Multi-modal Synthesis of Regular Expressions".

Prerequisites

Before runing the code for parsing your own language or reproducing experimental results, please first set up the Sempre tool following the instructions below:

cd sempre
./pull-dependencies core
./pull-dependencies corenlp
./pull-dependencies freebase
./pull-dependencies tables

This repository also requires the following:

Z3. Make sure you have Z3 installed with the Java binding.
ant to compile the java files.
python3 3.7
java 1.8.0

Alternately you can use the included Dockerfile to build a Docker image:

docker build -t regel:v1 .

Benchmarks

Existing Benchmarks

This repository includes two benchmarks domain:

StackOverflow ($benchmark_domain = "so")
DeepRegex ($benchmark_domain = "deepregex").

The benchmarks (including natural language, examples and ground truth) are under exp/$benchmark_domain/benchmark.

We include the set of sketches we used under exp/$benchmark_domain/sketch

Generate your own benchmarks

To generate your own benchmarks, you need to do the following:

Prepare a benchmark file

Create a new folder exp/$your_benchmark_domain

Inside exp/$your_benchmark_domain/benchmark, create benchmark files look like the following:

// natural language
# natural language description goes here

// examples
# write example in the format "$example_string$",$sign$ where sign can be
#   1) + to indicate it's a positive example
#   2) - to indicate it's a negative example

// gt
# ground truth in the dsl

The sample benchmarks can be found under the so and deepregex dataset.

Prepare a test/train set file for parser

The procedure to create such a file is under "Train Sketch Parser" -> "Prepare Train Set File". The procedure creating a test set file is same as the one creating a train set file except you fill the Sketch field with null.

Sketch Generation

Note: if you only runs the so or deepregex benchmarks, you don't need to generate sketch unless you trained a new model.

After having a set of benchmarks to generate sketch, run the following script:

python parse_benchmark.py --benchmark $your_benchmark_domain --model_dir $trained_model --max_sketch $number_of_sketch_generated_per_benchmark

The generated sketches will be put under exp/$your_benchmark_domain/sketch

`$trained_model`

We provided the pre-trained model for the two benchmarks datasets:

so: pretrained_models/pretrained_so

deepregex:pretrained_models/pretrained_turk

Sketch Completion

To get the instantiations of the sketches that satisfy the given examples, invoke the following command:

python exp.py --benchmark $your_benchmark_domain --log_name $log_folder_name --sketch $sketch_folder_name --mem_max $max_memeory_allowed --synth_mode $synthesis_mode --processnum $number_of_process_allowed   --timeout $timeout_for_each_benchmark

`$synthesis_mode`

The synthesizer can be runned in the following mode:

1: enables all Regel Functionalities

2: enables Regel with pruning using over and under-approxmiation only

4: enumeration with no pruning techniques

5: run Regel with no sketches

`$number_of_process_allowed`

Regel tries to instantiate multiple sketches at parallel. Let number_of_process_allowed = 1 if you wants to disable the parallel functionality. Otherwise, instantiate this parameter with an argument greater than 1. The default value is 5.

Output Processing

The script outputs to the $log_folder_name where a single file in the folder corresponds to the output of a single benchmark. To process the output in batch and generate a csv output, invoke the following script:

python process_output.py --log_folder $log_folder_name --log_path $path_to_log_folder --output_name $output_file_name

The output file will be inside the log folder as $output_file_name.csv

Interactive Mode

We also provide a way to run Regel interactively (i.e. allowing users to interact with Regel by providing examples to refine the synthesis results).

Run Interactive Mode with Benchmark Set

python interactive.py --run_mode 1 --benchmark $your_benchmark_domain --synth_mode $synthesis_mode --process_num $number_of_process_allowed --mem_max $max_memory_allowed --top $top_k_results_allowed --timeout $timeout_for_each_benchmark --max_iter $max_iter --save_history $save_history

`$top_k_results_allowed`

In interactive Regel, we only show user the first k finished sketch results.

`$save_history`

The interactive Regel allows you to stop at any point working and continue from where you left last time. Set this to True if you wants to enable this functionality.

`$max_iter`

The maximum of interaction allowed for each benchmark.

Additional examples

For benchmark domains so and deepregex, we provide a set of additional examples to further refine the results automatically. These additional examples are stored in interactive/$your_benchmark_domain/examples_cache. All the furture addtional examples you entered will also stored inside this directory.

The workflow of the interactive script:

For each benchmark:

Regel reads the benchmarks and sketches and invoke the synthesizer
Rank the outputs and get the $top_k_results_allowed results
Check if any of the results returned matches the ground truth
If matches ground truth, the script automatically goes to the next benchmark
If does not match the ground truth, the script first find examples in the examples_cache that matches the synthesized regex but not the ground truth regex (this will be a negative example) or matches the ground truth regex but not the synthesized regex (this will be a positive example) and uses this example as the one to refine the synthesizer
If there does not exist a example in the examples_cache that matches the criteria, the script will ask the user to enter two additional examples and indicate whether they are positive and negative examples
Regel will run again using the updated examples

Run Interactive Mode with Arbituary Natural Language and Examples ("Customize" Mode)

python interactive.py --run_mode 0 --synth_mode $synthesis_mode --process_num $number_of_process_allowed --mem_max $max_memory_allowed --top $top_k_results_allowed --timeout $timeout_for_each_benchmark --max_iter $max_iter --skecth_num $number_of_sketch_per_benchmark --save_history $save_history

The workflow of the interactive script (customize mode):

Enter a file name for your benchmark
Enter the natural language
Enter the examples
Indicating whether the example entered is a positive or negative example: use + to indicate it is a positive example, and - to indicate it is a negative example

The benchmark file created will be saved at exp/customize/benchmark/$benchmark_file_name

Regel will generate $number_of_sketch_per_benchmark number of sketches.

The sketch file created will be saved at exp/customize/sketch/$benchmark_file_name

Regel will run the synthesizer, and return either $top_k_results_allowed number of results or indicate it times out.
Regel will ask you if there is any correct regex returned. Enter y if there is and enter the correct regex index.
If you enter n for the last question, Regel will ask you to enter two additional examples to disambiguate the regexes
Enter the examples in the same way as you enter the initial examples. After getting the additional examples, Regel will rerun the synthesizer and return the updated synthesis result.

Output

The output of execution is saved at interactive/$your_benchmark_domain/logs/$synthesis_mode/raw_output.csv.

If you uses customize mode, $your_benchmark_domain = 'customize'

Train Sketch Parser

We have provided pretrained model that is ready to use in sempre/pretrained. We as well provide a examples procedure of how to train a model on StackOverflow dataset.

Prepare Train Set File

We show a example train set file in data/so.raw.txt. It is a TSV file with three fields (ID, NL, and Sketch). Please prepare your train set file in this format, and put it in the sempre/dataset/ directory with the name *dataset-name*.raw.txt.

Train Parse Script

To train a model, call

python py_scripts/train.py *dataset-name* *model-dir*

E.g. python py_scripts/train.py so models/so

This command will preproces the training data (to fit the form required by Sempre. The processed data form will be stored in regex/data/so) and train a model that will be saved in the *model-dir* directory.

Parse Script

To parse a set of language descriptions using trained model, call

python pyscripts/parse.py *dataset-name* *model-dir* *topk*, where *topk* is the desired number of sketches.

The parsed sketches will be in ouputs/*dataset-name*, where each file contains the sketches for a single benchmark.

utopia-group / regel

readme