Read Formal Writeup of the original paper: https://arxiv.org/pdf/cs/0412098.pdf
The Normalized Google Distance (NGD) is a semantic similarity measure, calculated based on the number of hits returned by Google for a set of keywords. If keywords have many pages in common relative to their respective, independent frequencies, then these keywords are thought to be semantically similar.
If two search terms w1
and w2
never occur together on the same web page, but do occur separately, the NGD between them is infinite.
Conversely, if both terms always occur together, and only occur together, their NGD is zero.
This script provides some useful functions for working with NGD.
To compute the NGD between two words:
ngd = calculate_NGD("w1", "w2")
To compute pairwise NGDs (ex: computing the NGD for a matrix of political candidates)
L = ["w1", "w2", "w3"]
distances = pairwise_NGD(L)
This will return a nested dictionary, where distances[i][j] = NGD(L_i, L_j)
To return the above matrix as a dataframe:
distances = pairwise_NGD(L)
matrix_df = pairwise_NGD_to_df(distances)
n
there are [(n-1)(n)]/2
distinct comparisons.Here's a case study I did, looking at how the media talks about political candidates.