josh-ashkinaze / Normalized-Google-Distance

A python script to calculate normalized google distance (NGD). This is a semantic similarity metric based on Google search results
15 stars 6 forks source link

Normalized Google Distance

Read Formal Writeup of the original paper: https://arxiv.org/pdf/cs/0412098.pdf

About

The Normalized Google Distance (NGD) is a semantic similarity measure, calculated based on the number of hits returned by Google for a set of keywords. If keywords have many pages in common relative to their respective, independent frequencies, then these keywords are thought to be semantically similar.

If two search terms w1 and w2 never occur together on the same web page, but do occur separately, the NGD between them is infinite.

Conversely, if both terms always occur together, and only occur together, their NGD is zero.

This script provides some useful functions for working with NGD.

Methods

Simple NGD

To compute the NGD between two words:

ngd = calculate_NGD("w1", "w2")

Pairwise NGD

As a dictionary

To compute pairwise NGDs (ex: computing the NGD for a matrix of political candidates)

L = ["w1", "w2", "w3"]
distances = pairwise_NGD(L)

This will return a nested dictionary, where distances[i][j] = NGD(L_i, L_j)

As a dataframe

To return the above matrix as a dataframe:

distances = pairwise_NGD(L)
matrix_df = pairwise_NGD_to_df(distances)

Warnings

Case Study

Here's a case study I did, looking at how the media talks about political candidates.