rdpahalavan / BERTSimilar

Get Similar Words and Embeddings using BERT Models
https://pypi.org/project/BERTSimilar/
6 stars 2 forks source link
bert bert-embeddings bert-model similarity-search

BERTSimilar

Get Similar Words and Embeddings using BERT Models

BERTSimilar is used to get similar words and embeddings using BERT models. It uses bert-base-cased model as default and cosine similarity to find the closest word to the given words.

BERT generates contextual word embeddings, so the word embedding for the same word will differ based on its context. For example, the word Apple in "Apple is a good fruit" and "Apple is a good phone" have different word embeddings. Generating word embeddings for all vocabulary in the English language based on context is time-consuming and needs many resources. So, this library requires the vocabulary for generating word embeddings beforehand.

Vocabularies used to generate word embeddings can be given in two ways:

Install and Import

Install the Python package using

pip install BERTSimilar

Import the module using

>>> from BERTSimilar import SimilarWords

Providing the Vocabulary

Provide the text (in terms of paragraphs), so the BERT model can generate the word embeddings for all the words present in the text.

Using Wikipedia Pages

1) Using Wikipedia page names as a list (the content of the pages will be taken as input and processed)

>>> wikipedia_pages = ['Apple', 'Apple Inc.']
>>> similar = SimilarWords().load_dataset(wikipedia_page_list=wikipedia_pages)

# To get the Wikipedia pages used,
>>> similar.wikipedia_dataset_info
{'Apple': 'https://en.wikipedia.org/wiki/Apple',
 'Apple Inc.': 'https://en.wikipedia.org/wiki/Apple_Inc.'}

2) Using Wikipedia search query as a string (the content of the pages related to the query will be taken as input and processed)

# Get 5 Wikipedia pages based on the query
>>> similar = SimilarWords().load_dataset(wikipedia_query='Apple', wikipedia_query_limit=5)

# To get the Wikipedia pages used (duplicate pages are ignored),
>>> similar.wikipedia_dataset_info
{'Apple': 'https://en.wikipedia.org/wiki/Apple',
 'Apple Inc.': 'https://en.wikipedia.org/wiki/Apple_Inc.',
 'Apples to Apples': 'https://en.wikipedia.org/wiki/Apples_to_Apples',
 'MacOS': 'https://en.wikipedia.org/wiki/MacOS'}

3) Using Wikipedia search queries as a list (the content of the pages related to each query will be taken as input and processed)

# Get 5 Wikipedia pages based on each query
>>> similar = SimilarWords().load_dataset(wikipedia_query=['Apple', 'Banana'], wikipedia_query_limit=5)

# To get the Wikipedia pages used (duplicate pages are ignored),
>>> similar.wikipedia_dataset_info
{'Apple': 'https://en.wikipedia.org/wiki/Apple',
 'Apple Inc.': 'https://en.wikipedia.org/wiki/Apple_Inc.',
 'Apples to Apples': 'https://en.wikipedia.org/wiki/Apples_to_Apples',
 'MacOS': 'https://en.wikipedia.org/wiki/MacOS',
 'Banana': 'https://en.wikipedia.org/wiki/Banana',
 'Cooking banana': 'https://en.wikipedia.org/wiki/Cooking_banana',
 'Banana republic': 'https://en.wikipedia.org/wiki/Banana_republic',
 'Banana ketchup': 'https://en.wikipedia.org/wiki/Banana_ketchup'}

Using Text Files

File extensions supported are .docx and .txt (For other file types, please convert them to the supporting format)

1) Using a single text file (the content of the file will be taken as input and processed)

>>> similar = SimilarWords().load_dataset(dataset_path='Book_1.docx')

2) Using multiple text files (the contents of each file will be taken as input and processed)

>>> similar = SimilarWords().load_dataset(dataset_path=['Book_1.docx','Book_1.txt'])

SimilarWords() Parameters

You can pass these parameters to customize the initialization.

Find Similar Words

Similar words are generated using the find_similar_words method. This method calculates the cosine similarity between the average of the input words based on the given context and all the words present in the given vocabulary. The similar words and the embedding used to select the nearest words will be returned. This embedding is the representation of input words and context. The parameters for this method are

Demo

Example 1

>>> from BERTSimilar import SimilarWords
>>> similar = SimilarWords().load_dataset(wikipedia_query='Apple', wikipedia_query_limit=5)

>>> words, embedding = similar.find_similar_words(input_context='company',input_words=['Apple'])
>>> words
{'iPhone': 0.7655301993367924,
 'Microsoft': 0.7644559773925612,
 'Samsung': 0.7483747939272186,
 'Nokia': 0.7418908483628721,
 'Macintosh': 0.7415292245659537,
 'iOS': 0.7409453358937249,
 'AppleCare': 0.7381210698272941,
 'iPadOS': 0.7112217377139232,
 'iTunes': 0.7007508157223745,
 'macOS': 0.69984740983893}

>>> words, embedding = similar.find_similar_words(input_context='fruit',input_words=['Apple'])
>>> words
{'applejack': 0.8045216200651304,
 'Trees': 0.7926505935113519,
 'trees': 0.7806807879003239,
 'berries': 0.7689437435792672,
 'seeds': 0.7540070238557037,
 'peaches': 0.7381803534675645,
 'Orange': 0.733131237417253,
 'orchards': 0.7296196594053761,
 'juice': 0.7247635163014543,
 'nuts': 0.724424004884171}

Example 2

>>> from BERTSimilar import SimilarWords
>>> similar = SimilarWords().load_dataset(wikipedia_query='Tesla', wikipedia_query_limit=10)

>>> words, embedding = similar.find_similar_words(input_context='Tesla Motors', input_words=['CEO'], output_words_ngram=5, max_output_words=5)
>>> words
{'Chief Executive Elon Musk handing': 0.7596588355056113,
 '2018 CEO Elon Musk briefly': 0.751011374230985,
 'August 2018 CEO Elon Musk': 0.7492089016517951,
 '2021 CEO Elon Musk revealed': 0.7470401856896459,
 'SEC questioned Tesla CFO Zach': 0.738144930474394}

>>> words, embedding = similar.find_similar_words(input_words=['Nikola Tesla'], output_words_ngram=0, max_output_words=5)
>>> words
{'Tesla Nikola Tesla Corner': 0.9203870154998232,
 'IEEEThe Nikola Tesla Memorial': 0.8932847992637643,
 'electrical engineer Nikola Tesla': 0.8811208719958945,
 'Serbian American inventor Nikola Tesla': 0.8766566716046287,
 'Nikola Tesla Technical Museum': 0.8759513407776292}

SimilarWords() Attributes

These attributes can be used to get values or modify default values of the SimilarWords class.

To get the value of the attributes,

>>> similar = SimilarWords().load_dataset(dataset_path='Book_1.docx')

# This will return all the words
>>> similar.bert_words

# This will return the embeddings for all the words
>>> similar.bert_vectors

To change the values of the attributes,

>>> similar = SimilarWords()
>>> similar.max_ngram
10

>>> similar.max_ngram = 12
>>> similar = similar.load_dataset(dataset_path='Book_1.docx')
>>> similar.max_ngram
12