how to generate these files which shared in resources folder?

TITC commented 3 years ago

Dear author, the google folder shared by you contains follow files, which is used in convert_hdf5_sow.py.

-r-------- 1 root root 351K May  6  2020 bpe.codes
-r-------- 1 root root 865K May  6  2020 parse_vocab.pkl
-r-------- 1 root root 865K May  6  2020 parse_vocab_rules.pkl
-r-------- 1 root root 1.2K May  6  2020 pos_vocab.pkl
-r-------- 1 root root 382K May  6  2020 vocab.txt

What a pity is I can't found any description about these files, I'm trying to figure out the purpose of these files.

Here is some information I knew.

parse_vocab_rules.pkl it's a tuple and each element is a dict, the length of each list is 32133
parse_vocab.pkl the length of this pkl's dict is 32132
pos_vocab.pkl It's a dict with each elements is Part-Of-Speech(POS)

As I think the third file, pos_vocab.pkl can be reused in the Chinese version, but how can I generate those first two files?and what's the purpose of these files?

tagoyal commented 3 years ago

Hi, Those files are just an indexer from word -> index and index -> word. For our word, we used the BPE tokens from the vocab.txt files to generate these.

For the code to work:

parse_vocab: Get your vocabulary of words. Add PAD, BOS, EOS to the vocab, and create this indexer.

For parse_vocab_pos: Get your vocabulary of words. Add PAD, BOS, EOS, X, Y to the vocab, and create this indexer.

Sample code:

import pickle as pk

input_file = open("data/vocab.txt", "r")
output_file = open("data/parse_vocab_rules.pkl", "wb")
pp_vocab = {}
pp_rev_vocab = {}
pp_vocab["PAD"] = 0
pp_rev_vocab[0] = "PAD"

pp_vocab["BOS"] = 1
pp_rev_vocab[1] = "BOS"
pp_vocab["EOS"] = 2
pp_rev_vocab[2] = "EOS"

pp_vocab["X"] = 3
pp_rev_vocab[3] = "X"
pp_vocab["Y"] = 4
pp_rev_vocab[4] = "Y"

id = 5
for line in input_file.readlines():
    word, count = line.strip().split()
    if word in pp_vocab.keys(): continue
    pp_vocab[word] = id
    pp_rev_vocab[id] = word
    id += 1

pk.dump((pp_vocab, pp_rev_vocab), output_file)

TITC commented 3 years ago

thank, I got your point.

tagoyal / sow-reap-paraphrasing

how to generate these files which shared in resources folder? #14