shailja-thakur / CodeGen-Fine-Tuning

Apache License 2.0
32 stars 6 forks source link

CodeGen fine tuning with HuggingFace + Deepspeed

This is a step by step process for fine-tuning CodeGen on specific programming languages using huggingface transformers and deepspeed

CodeGen is a suite of code based language models by SalesForce (https://github.com/salesforce/CodeGen/blob/main/README.md). Model sizes vary with respect to their training corpus, and model parameters. Models are named as per the convention codegen-{model-size}-{data}.

model-size has 4 options: 350M, 2B, 6B, 16B, which represent the number of parameters in each model.

data has 3 options: nl, multi, mono.

A detailed description of the models isas follows:

CodeGen models

model name data model-size
codegen-350M-nl nl 350M
codegen-350M-multi multi 350M
codegen-350M-mono mono 350M
codegen-2B-nl nl 2B
codegen-2B-multi multi 2B
codegen-2B-mono mono 2B
codegen-6B-nl nl 6B
codegen-6B-multi multi 6B
codegen-6B-mono mono 6B
codegen-16B-nl nl 16B
codegen-16B-multi multi 16B
codegen-16B-mono mono 16B

Following is a detailed set of instruction for replicating the CodeGen fine-tuning on a local server:

The following steps have been tested on an HPC with a sungularity container with Ubuntu20.04 and 50GB RAM. However, the setup can also be replicated on a machine with ubuntu 20.04.

Prepare training corpus.

For CodeGen models, the data format has to be in a loose json format with one json per line followed by a new line as follows:

{‘text’: your data chunk 1}\n {‘text’: your data chunk 2}\n ...

I used the following code snippet to prepare the json,

with open('code_segments.json','a') as f:
    for row in df_code['text'].values:
        dic={"text":str(row)}
        ob=json.dumps(dic)
        f.write(ob)
        f.write('\n')
f.close()

Note, in this case, the for loop iterates a pandas dataframe df_code with a column named text. You may tweak the code snippet according to the type of data you will be rading.

Prepare the environment on your machine

I recommend the followinf for fine-tuning. I created a conda environment inside the singularity container, however, if you are not using container, you may create a conda environment direclty on your machine,

conda create --name anyname python=3.X
then, activate the environment
conda activate anyname

And later, install the following software libraries inside the environment (conda activate name_of_the_conda_env). Please note that, it is assumed the pre-requisited are installed (pip, sklearn,pandas,numpy,scipy, and other packeages for doing basic data science).

pip install git+https://github.com/huggingface/transformers/
pip install deepspeed
deepspeed --num_gpus 2 --num_nodes 1 run_clm.py --model_name_or_path=Salesforce/codegen-6B-multi --save_steps=100 --per_device_train_batch_size=1 --learning_rate 2e-5 --num_train_epochs 1 --output_dir=CodeGen/codegen-6B-verilog-3-epochs --report_to 'wandb' --dataset_name code_segments_verilog --tokenizer_name Salesforce/codegen-16B-multi --block_size 1024 --gradient_accumulation_steps 32 --do_train --do_eval --fp16 --overwrite_output_dir --deepspeed ds_config.json"

To run the fine-tuning as a job on HPC, I created a slurm script (run-codegen-finetune.SBATCH) which runs the above command in a slurm script with conda environment within singularity container.

wandb login

Note, that the wandb session may timeout, so, you can also open a new termina, login to wandb, and leave that terminal open while you execute the fine-tuning in another window.

It is possible to remove the wandb option from the fine-tuning altogether by removing the option and continue fine-tuning. If you would like to use tensorboard in place of wandb, then simply replace the wandb with tensorboard, and configure tensorboard path (https://www.tensorflow.org/tensorboard/get_started).

Installing packages using requirement.txt

You can also install the requirements as follows, and take care of the conflicting libraries along the way

python3 -m venv env
source env/bin/activate
pip install -r requirements.txt