neoVincent / Document_Similarity

Large scale document similarity analysis on Spark
0 stars 0 forks source link

Document Similarity

Objective: Get familiar with Big data technology (Spark) and NLP

Work flow

NLP tasks

note:

  • the trained Glove model is used as vector space
  • NLP related functions are located in core.py

Modules

core

responsible for the NLP tasks

mySpark

responsible for the Spark Environment set up

doc2vec

calculate the document vector

similarity

get top k similar document in database (cosine value, doc index)

db

database configuration

SpaCy

NLP packages

Models & Languages

load model

python3 -m spacy download en_core_web_sm 

Get model info

print(spacy.info("en_core_web_sd"))

en_core_web_sm: doesn't come with a word vectors table en_core_web_md: has a relatively small number of vectors(between 10k and 20k)

Spark

Set up

Trouble shooting

Spark unable to load the conf

maybe set the SPARK_HOME='/usr/local/bin/***/libexec https://spark.apache.org/docs/latest/configuration.html#environment-variables

Spark use different python version

Set the following environment variables in .bash_profile

note:

  • use zsh instead of bash
    • add source ~/.bash_profile in .zsh_rc

load-spark-env.sh: Permission denied

set the SPARK_HOME='/usr/local/bin/***/libexec stackoverflow

objc_initializeAfterForkError

This is a multi-thread issue for python. set the following environment variables for workaround export OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES stackoverflow

unable to login db using password

set the auth_plugin='mysql_native_password' while connect to db

read blob into numpy array

remember to add these value while connect to db