tagoyal / sow-reap-paraphrasing

Contains data/code for the paper "Neural Syntactic Preordering for Controlled Paraphrase Generation" (ACL 2020).
76 stars 9 forks source link

how to generate these files which shared in resources folder? #14

Closed TITC closed 3 years ago

TITC commented 3 years ago

Dear author, the google folder shared by you contains follow files, which is used in convert_hdf5_sow.py.

-r-------- 1 root root 351K May  6  2020 bpe.codes
-r-------- 1 root root 865K May  6  2020 parse_vocab.pkl
-r-------- 1 root root 865K May  6  2020 parse_vocab_rules.pkl
-r-------- 1 root root 1.2K May  6  2020 pos_vocab.pkl
-r-------- 1 root root 382K May  6  2020 vocab.txt

What a pity is I can't found any description about these files, I'm trying to figure out the purpose of these files.


Here is some information I knew.


As I think the third file, pos_vocab.pkl can be reused in the Chinese version, but how can I generate those first two files?and what's the purpose of these files?

tagoyal commented 3 years ago

Hi, Those files are just an indexer from word -> index and index -> word. For our word, we used the BPE tokens from the vocab.txt files to generate these.

For the code to work:

parse_vocab: Get your vocabulary of words. Add PAD, BOS, EOS to the vocab, and create this indexer.

For parse_vocab_pos: Get your vocabulary of words. Add PAD, BOS, EOS, X, Y to the vocab, and create this indexer.

Sample code:

import pickle as pk

input_file = open("data/vocab.txt", "r")
output_file = open("data/parse_vocab_rules.pkl", "wb")
pp_vocab = {}
pp_rev_vocab = {}
pp_vocab["PAD"] = 0
pp_rev_vocab[0] = "PAD"

pp_vocab["BOS"] = 1
pp_rev_vocab[1] = "BOS"
pp_vocab["EOS"] = 2
pp_rev_vocab[2] = "EOS"

pp_vocab["X"] = 3
pp_rev_vocab[3] = "X"
pp_vocab["Y"] = 4
pp_rev_vocab[4] = "Y"

id = 5
for line in input_file.readlines():
    word, count = line.strip().split()
    if word in pp_vocab.keys(): continue
    pp_vocab[word] = id
    pp_rev_vocab[id] = word
    id += 1

pk.dump((pp_vocab, pp_rev_vocab), output_file)
TITC commented 3 years ago

thank, I got your point.