Open codeashcode opened 8 years ago
I have the same problem with it. Thanks!
I am also stuck with the same question.
This year I also have this issue. Time didn't help to resolve it :) Maybe you have any suggestions?
Hmm.. I'm not familiar with the overnight package, but there are a few classes in edu.stanford.nlp.sempre.overnight
that can be directly invoked (i.e., have the public static void main
method).
Here are some guesses based on reading the code + looking at the files in lib/data/overnight/
(retrieved by calling ./pull-dependencies overnight
):
basketball.paraphrases.train.examples
) is probably manually created based on the Turked paraphrases. The things to show the Turkers are the canonical utterances from utterances_formula.tsv
gotten by the genovernight
mode.CreateBerkeleyAlignerInputFromLispTree.java
takes the LispTree file and outputs two files, .e
and .f
..e
and .f
files and produces training.e-f.align
(among other things). This file contains the predicted word alignments (e.g., "he ran" and "he locomoted switfly" would produce "0-0 1-1 1-2")Aligner.java
takes 4 arguments: example file, output file, alignment algorithm name ("heuristic" or "berkeley"), and score threshold.
lib/data/overnight/basketball.paraphrases.train.examples
). The output is probably basketball.phrase_alignments
.training.e-f.align
. That is, each line should look something like "he ran [tab] he locomoted switfly [tab] 0-0 1-1 1-2". The output is probably basketball.word_alignments.berkeley
. basketball-ppdb.txt
seems to be lines from the PPDB database. These only contain word-to-word paraphrases, so my guess is that they just filter PPDB to the (word1, word2) combinations that appear in the dataset. (This is perhaps cutting corners. In reality, I imagine that a whole PPDB querying system has to be implemented so that any test-time word pairs can be queried.)
I understood that that @mode "genovernight" can be used to dump set of (z,c) - (logical form, canonical utterances) and by paraphrasing we can get (z,c,x) - (logical form, canonical utterance, paraphrase utterances). This set of (z,c,x) can be divided into two files:.paraphrases.train.examples and .paraphrases.test.examples.
But there are more inputs needed to train the semantic parser and those are following:
a..train.superlatives.example - superlative training (and test file too)
b. .phrase_alignments - phrase alignment file
c. .word_alignments.berkeley - word alignment file
d. -ppdb.txt - ppdb model
How I can generate these files for my domain? Details about steps to produce these files will be really helpful.