nicoDs96 / Document-Similarity-using-Python-and-PySpark

Document Similarity with Apache Spark using Locality Sesitive Hashing and Python
https://nicods96.github.io/hi//fast-document-similarity-in-python/
7 stars 2 forks source link
document-similarity locality-sensitive-hashing lsh pyspark python

Document Similarity using Spark, Python and Web Scraping

In this repository we are going to check similarity between kijiji ads. Data are first processed using classical programming and then the code is re-implemented using Map Reduce framework with PySpark (Apache spark for Python).

You can read something more about it here.

Web Scraping

This scripts just create a .csv dataset with fields: title, short description, location, price, timestamp, url
The code is quite general but the function def adv_extractor_to_dataset(adv_container, dest_file) is highly specific to the website we are scraping. It just takes as input the HTML containing all the data and file where to write extracted data. This function can be considered as a sample upon which you can build your own data extractor.

Dependencies

  1. Beautiful Soup 4
  2. Pandas

Preprocessing data with NLTK

For the linguistic analysis we used the NLTK Python library.

  1. Firt we have to preprocess data. For each announcement (downloaded during web scraping) a textual Field (e.g. title + full description).Preprocessing consist of stopword removal, normalization, stemming.
  2. The next step is to build a search-engine index. First, we need to build an inverted index, and store it in a File. To Build an index that allows to perform proximity queries using the cosine-similarity measure we have choosen fast cosine similarity algorithm (check .ipynb for details). Then we have some query-processing code, which, given some terms it will bring the most related announcements.

    Code is provided as a notebook, so we have either text and code with output.

Dependencies

  1. Pandas
  2. NLTK

Shingling, Minwise Hashing, and Locality-Sensitive Hashing.

Here we are implementing nearest-neighbor search for text documents. Implementation comprises shingling, minwise hashing, and locality-sensitive hashing. We split it into several parts:

PySpark Implementation of LSH

Same principle but using MapRedue functions and key-value programming style to process data.

Dependencies

  1. PySpark
  2. Find Spark