truythu169 / snml-skip-gram

Applying Sequentially Normalized Maximum Likelihood in Skip-gram model
1 stars 0 forks source link

Applying information criteria in Skip-gram dimensionality selection

Implementation of Skip-gram Dimensionality Selection via information criteria (SNML, AIC, BIC).

Requirement

Please make sure your computer has installed these programs below:

python3.7
pip

Set up

Google cloud storage (GCS) set up

We train models on multiple servers but save the result on a GCS bucket. Please create an env.ini file to store access to the bucket. The env.ini file should be place in the root directory. The file should include content as following:

[GCS]
sync = no
project_id = xxx
bucket = xxx
app_credential = xxx

Configs:

1. Preprocess data

1.1. Artificial data

Artificial data is generated using jupyter notebooks. Please refer to to notebooks below for the details of the data generation process.

1.2. Text data

Run prepocess.py file to prepocess data. This file takes .txt file as input. Please remove special characters such as .,:? etc in the text file.
Parameters:

2. Train Skip-gram

Data after preprocess step can be use to train Skip-gram. Training commands are described as below:

2.1. Train original Skip-Gram model

Original Skip-Gram model should be trained using GPUs, we use tensorflow to train this model. Run tf_based/train.py to train this model.
Example:

python tf_based/train.py --input_path data/text8/ --batch_size 10 --output_path output/text8/ --epochs 1 --n_embedding 5

See config.ini and tf_based/train.py for more parameters settings.

2.2. Train Skip-Gram model with Negative Sampling

Skip-Gram Negative Sampling model is trained with numpy. Training process need context distribution to sample negative samples. Context distribution can be achieved by runing: utils/context_distribution_from_raw.py.
Run np_based/train.py to train this model.
Example:

python np_based/train.py --input_path data/text8/ --batch_size 10 --output_path output/text8/ --epochs 1 --n_embedding 5

See config.ini and np_based/train.py for more parameters settings.

3. Estimate AIC & BIC

Estimating AIC & BIC for original Skip-Gram and Skip-Gram Negative Sampling by following programs:

See each python file for parameters setting.

4. Estimate SNML codelength

Estimating SNML for original Skip-Gram and Skip-Gram Negative Sampling by following programs:

Parameters: