Overnight parsing - question about input files

codeashcode commented 8 years ago

I understood that that @mode "genovernight" can be used to dump set of (z,c) - (logical form, canonical utterances) and by paraphrasing we can get (z,c,x) - (logical form, canonical utterance, paraphrase utterances). This set of (z,c,x) can be divided into two files: .paraphrases.train.examples and .paraphrases.test.examples.

But there are more inputs needed to train the semantic parser and those are following:

a. .train.superlatives.example - superlative training (and test file too) b. .phrase_alignments - phrase alignment file c. .word_alignments.berkeley - word alignment file d. -ppdb.txt - ppdb model

How I can generate these files for my domain? Details about steps to produce these files will be really helpful.

Zhenshan-Jin commented 6 years ago

I have the same problem with it. Thanks!

BrijeshKaria commented 6 years ago

I am also stuck with the same question.

mmarinated commented 5 years ago

This year I also have this issue. Time didn't help to resolve it :) Maybe you have any suggestions?

ppasupat commented 5 years ago

Hmm.. I'm not familiar with the overnight package, but there are a few classes in edu.stanford.nlp.sempre.overnight that can be directly invoked (i.e., have the public static void main method).

Here are some guesses based on reading the code + looking at the files in lib/data/overnight/ (retrieved by calling ./pull-dependencies overnight):

The examples LispTree file (e.g., basketball.paraphrases.train.examples) is probably manually created based on the Turked paraphrases. The things to show the Turkers are the canonical utterances from utterances_formula.tsv gotten by the genovernight mode.
CreateBerkeleyAlignerInputFromLispTree.java takes the LispTree file and outputs two files, .e and .f.
Berkeley unsupervised aligner takes these .e and .f files and produces training.e-f.align (among other things). This file contains the predicted word alignments (e.g., "he ran" and "he locomoted switfly" would produce "0-0 1-1 1-2")
Aligner.java takes 4 arguments: example file, output file, alignment algorithm name ("heuristic" or "berkeley"), and score threshold.
- For the "heuristic" algorithm, the input file should be a LispTree file (maybe like lib/data/overnight/basketball.paraphrases.train.examples). The output is probably basketball.phrase_alignments.
- For the "berkeley" algorithm, the input file is a TSV with 3 columns: two for utterances, and one from training.e-f.align. That is, each line should look something like "he ran [tab] he locomoted switfly [tab] 0-0 1-1 1-2". The output is probably basketball.word_alignments.berkeley.
basketball-ppdb.txt seems to be lines from the PPDB database. These only contain word-to-word paraphrases, so my guess is that they just filter PPDB to the (word1, word2) combinations that appear in the dataset. (This is perhaps cutting corners. In reality, I imagine that a whole PPDB querying system has to be implemented so that any test-time word pairs can be queried.)

percyliang / sempre

Overnight parsing - question about input files #109