Implementation of Skip-gram Dimensionality Selection via information criteria (SNML, AIC, BIC).
Please make sure your computer has installed these programs below:
python3.7
pip
pip install -r requirements.txt
We train models on multiple servers but save the result on a GCS bucket. Please create an env.ini
file to store access to the bucket.
The env.ini
file should be place in the root directory. The file should include content as following:
[GCS]
sync = no
project_id = xxx
bucket = xxx
app_credential = xxx
Configs:
no
if you do not want to us GCSArtificial data is generated using jupyter notebooks. Please refer to to notebooks below for the details of the data generation process.
notebooks/Generate context distributions.jpynb
notebooks/Generate context distributions - SGNS.jpynb
Run prepocess.py file to prepocess data. This file takes .txt file as input.
Please remove special characters such as .,:? etc in the text file.
Parameters:
python preprocess.py --input text8 --output data/text8 --batch_size 1000 --window_size 5
Others parameter for preprocessing such as subsampling threshold can be set in config.ini.
Data after preprocess step can be use to train Skip-gram. Training commands are described as below:
Original Skip-Gram model should be trained using GPUs, we use tensorflow to train this model.
Run tf_based/train.py to train this model.
Example:
python tf_based/train.py --input_path data/text8/ --batch_size 10 --output_path output/text8/ --epochs 1 --n_embedding 5
See config.ini
and tf_based/train.py
for more parameters settings.
Skip-Gram Negative Sampling model is trained with numpy. Training process need context distribution to sample negative samples.
Context distribution can be achieved by runing: utils/context_distribution_from_raw.py
.
Run np_based/train.py to train this model.
Example:
python np_based/train.py --input_path data/text8/ --batch_size 10 --output_path output/text8/ --epochs 1 --n_embedding 5
See config.ini
and np_based/train.py
for more parameters settings.
Estimating AIC & BIC for original Skip-Gram and Skip-Gram Negative Sampling by following programs:
original Skip-Gram:
python tf_based/run_aic_bic.py
Skip-Gram Negative Sampling:
python np_based/run_aic_bic.py
See each python file for parameters setting.
Estimating SNML for original Skip-Gram and Skip-Gram Negative Sampling by following programs:
original Skip-Gram:
python tf_based/snml/tf_based/train_snml.py
Skip-Gram Negative Sampling:
python np_based/train_snml.py
Parameters: