pkouris / abtextsum

Abstractive text summarization based on deep learning and semantic content generalization
17 stars 4 forks source link

Using BERT instead of the word_2_vec embedding #6

Open appledora opened 3 years ago

appledora commented 3 years ago

In model.py we have the following method which loads the static embedding for the built datasets :

def get_init_embedding(int2word_dict, embedding_dim, word2vec_file):
        word_vectors = KeyedVectors.load_word2vec_format(word2vec_file)
        print(f"Shape of word_vectors {word_vectors.shape}")
        word_vec_list = list()
        for _, word in sorted(int2word_dict.items()):
            try:
                word = word.split(sep="_")[0]
                if word in ['LOCATION']:
                    word = 'জায়গা'
                elif word in ['PERSON']:
                    word = 'লোক'
                elif word in ['ORGANIZATION']:
                    word = 'প্রতিষ্ঠান'
                word_vec = word_vectors.word_vec(word)
            except KeyError:
                word_vec = np.zeros([embedding_dim], dtype=np.float32)
            word_vec_list.append(word_vec)
        # random vector for <s> and </s>
        word_vec_list[2] = np.random.normal(0, 1, embedding_dim)
        word_vec_list[3] = np.random.normal(0, 1, embedding_dim)
        return np.array(word_vec_list)

I was wondering, whether we can replace it with transformer-based pretrained weights? Since, the internal architecture is basically, an encoder-decoder structure, I presume that this should be available. I am just a bit confused about the places where I would have make changes.

Tldr; I want to feed the output of build_dataset.py to a BERT module. Thanks :)