In model.py we have the following method which loads the static embedding for the built datasets :
def get_init_embedding(int2word_dict, embedding_dim, word2vec_file):
word_vectors = KeyedVectors.load_word2vec_format(word2vec_file)
print(f"Shape of word_vectors {word_vectors.shape}")
word_vec_list = list()
for _, word in sorted(int2word_dict.items()):
try:
word = word.split(sep="_")[0]
if word in ['LOCATION']:
word = 'জায়গা'
elif word in ['PERSON']:
word = 'লোক'
elif word in ['ORGANIZATION']:
word = 'প্রতিষ্ঠান'
word_vec = word_vectors.word_vec(word)
except KeyError:
word_vec = np.zeros([embedding_dim], dtype=np.float32)
word_vec_list.append(word_vec)
# random vector for <s> and </s>
word_vec_list[2] = np.random.normal(0, 1, embedding_dim)
word_vec_list[3] = np.random.normal(0, 1, embedding_dim)
return np.array(word_vec_list)
I was wondering, whether we can replace it with transformer-based pretrained weights? Since, the internal architecture is basically, an encoder-decoder structure, I presume that this should be available. I am just a bit confused about the places where I would have make changes.
Tldr; I want to feed the output of build_dataset.py to a BERT module. Thanks :)
In
model.py
we have the following method which loads the static embedding for the built datasets :I was wondering, whether we can replace it with transformer-based pretrained weights? Since, the internal architecture is basically, an encoder-decoder structure, I presume that this should be available. I am just a bit confused about the places where I would have make changes.
Tldr; I want to feed the output of
build_dataset.py
to aBERT
module. Thanks :)