Objective: Get familiar with Big data technology (Spark) and NLP
NLP task
)note:
- the trained
Glove
model is used as vector space- NLP related functions are located in
core.py
responsible for the NLP tasks
responsible for the Spark Environment set up
calculate the document vector
get top k similar document in database (cosine value, doc index)
database configuration
NLP packages
load model
python3 -m spacy download en_core_web_sm
Get model info
print(spacy.info("en_core_web_sd"))
en_core_web_sm: doesn't come with a word vectors table en_core_web_md: has a relatively small number of vectors(between 10k and 20k)
/usr/local/bin/python3
/usr/local/bin/python3
maybe set the SPARK_HOME='/usr/local/bin/***/libexec https://spark.apache.org/docs/latest/configuration.html#environment-variables
Set the following environment variables in .bash_profile
note:
- use zsh instead of bash
- add source ~/.bash_profile in .zsh_rc
set the SPARK_HOME='/usr/local/bin/***/libexec stackoverflow
This is a multi-thread issue for python. set the following environment variables for workaround
export OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES
stackoverflow
set the auth_plugin='mysql_native_password' while connect to db
remember to add these value while connect to db
conn = mysql.connector.connect(
host=jdbcHostname,
user=jdbcuser,
passwd=jdbcpwd,
database=jdbcDatabase,
charset='utf8',
use_pure='True',
auth_plugin='mysql_native_password')