vgaraujov / CPC-NLP-PyTorch

Implementation of Contrastive Predictive Coding for Natural Language
10 stars 3 forks source link
constrastive-learning natural-language-processing representation-learning self-supervised-learning

Contrastive Predictive Coding for Natural Language

This repository contains a PyTorch implementation of CPC v1 for Natural Language (section 3.3) from the paper Representation Learning with Contrastive Predictive Coding .

Implementation Details

I followed the details mentioned in section 3.3. Also, I got missing details directly from one of the paper's authors.

Embedding layer

Encoder layer (g_enc)

Recurrent Layer (g_ar)

Prediction Layer {W_k}

Training details

Requirements

Usage Instructions

1. Pretraining

Configuration File

This implementation uses a configuration file for convenient configuration of the model. The config_cpc.yaml file includes original parameters by default. You have to adjust the following parameters to get started:

Optionally, if you want to log your experiments with comet.ml, you just need to install the library and write your api_key.

Dataset

This model uses BookCorpus dataset for pretrainig. You have to organize your data according to the following structure:

├── BookCorpus
│   └── data
│       ├── file_1.txt
│       ├── file_2.txt 

Then you have to write the path of your dataset in the books_path parameter of the config_cpc.yaml file.

Note: You could use publicly available files provided by Igor Brigadir at your own risk.

Training

When you have completed all the steps above, you can run:

python main.py

The implementation automatically saves a log of the experiment with the name cpc-date-hour and also saves the model checkpoints with the same name.

Resume Training

If you want to resume your model training, you just need to write the name of your experiment (cpc-date-hour) in the resume_name parameter of the config_cpc.yaml file and then run train.py.

2. Vocabulary Expansion

The CPC model employs vocabulary expansion in the same way as the Skip-Thought model. You just need to modify the run_name and word2vec_path parameters to then execute:

python vocab_expansion.py

The result is a numpy file of embeddings and a pickle file of the vocabulary. They will appear in a folder named vocab_expansion/.

3. Training a Classifier

Configuration File

This implementation uses a configuration file for configuration of the classfier. You have to set the following parameters of the config_clf.yaml file:

Dataset

This classifier uses a common NLP benchmark. You have to organize your data according to the following structure:

├── dataset_name
│   └── data
│       └── task_name
│           ├── task_name.train.txt
│           ├── task_name.dev.txt 

Then you have to set the path of your data (dataset_path) and task name (dataset_name) in the config_cpc.yaml file.

Note: You could use publicly available files provided by zenRRan.

Training

When you have completed the steps above, you can run:

python main_clf.py

The implementation automatically saves a log of the experiment with the name cpc-clf-date-hour and also saves the model checkpoints with the same name.

Disclaimer

The model should be trained for 1e8 steps with a batch size of 64 * 8 GPUs. The authors provided me a snapshot of the first 1M training steps that you can find here, and you can find the results of my implementation here. There is a slight difference which may be due to various factors such as dataset or initialization. I have not been able to train the model entirely, so I did not replicate the results with the benchmark.

If anyone can fully train the model, feel free to share the results. I will be attentive to any questions or comments.

References