Closed TITC closed 3 years ago
Hi, Those files are just an indexer from word -> index and index -> word. For our word, we used the BPE tokens from the vocab.txt files to generate these.
For the code to work:
parse_vocab: Get your vocabulary of words. Add PAD, BOS, EOS to the vocab, and create this indexer.
For parse_vocab_pos: Get your vocabulary of words. Add PAD, BOS, EOS, X, Y to the vocab, and create this indexer.
Sample code:
import pickle as pk
input_file = open("data/vocab.txt", "r")
output_file = open("data/parse_vocab_rules.pkl", "wb")
pp_vocab = {}
pp_rev_vocab = {}
pp_vocab["PAD"] = 0
pp_rev_vocab[0] = "PAD"
pp_vocab["BOS"] = 1
pp_rev_vocab[1] = "BOS"
pp_vocab["EOS"] = 2
pp_rev_vocab[2] = "EOS"
pp_vocab["X"] = 3
pp_rev_vocab[3] = "X"
pp_vocab["Y"] = 4
pp_rev_vocab[4] = "Y"
id = 5
for line in input_file.readlines():
word, count = line.strip().split()
if word in pp_vocab.keys(): continue
pp_vocab[word] = id
pp_rev_vocab[id] = word
id += 1
pk.dump((pp_vocab, pp_rev_vocab), output_file)
thank, I got your point.
Dear author, the google folder shared by you contains follow files, which is used in
convert_hdf5_sow.py
.What a pity is I can't found any description about these files, I'm trying to figure out the purpose of these files.
Here is some information I knew.
parse_vocab_rules.pkl it's a
tuple
and each element is adict
, the length of each list is 32133parse_vocab.pkl the length of this pkl's dict is 32132
pos_vocab.pkl It's a
dict
with each elements is Part-Of-Speech(POS)As I think the third file, pos_vocab.pkl can be reused in the Chinese version, but how can I generate those first two files?and what's the purpose of these files?