ridgerchu / SpikeGPT

Implementation of "SpikeGPT: Generative Pre-trained Language Model with Spiking Neural Networks"
BSD 2-Clause "Simplified" License
759 stars 76 forks source link

Training setup #13

Open diederik-vink opened 10 months ago

diederik-vink commented 10 months ago

Hi, I'm attempting to replicate the training runs with all the different datasets. Could you provide some insight into the configuration that you used to train all three of the datasets you mentioned in the paper?

Thanks in advance!

ridgerchu commented 10 months ago

Hello, we trained our model using enwik8 and OpenWebText. For other datasets, we use a version pre-trained on OpenWebText. Could you specify which dataset you're interested in replicating, or do you want to replicate all of them?

diederik-vink commented 10 months ago

Hi, sorry for the lack of clarity. More specifically, I am looking to take the OpenWebText pre-trained model (primarily focusing on the 216M parameter version) on the both the WikiText-2 and WikiText-103 datasets as well as on enwik8 as defined in the train,py script. I've assumed that the provided code contains the setup used for enwik8, but I was curious to replicate the WikiText-2 and WikiText-103 fine tuning runs as well.

I have access to 4xV100 GPUs to be able to replicate your runs as accurately as possible. Additionally I have a functioning environment that can run train.py and would like to maximize performance so I have hoped to avoid using the Docker image if possible.

ridgerchu commented 10 months ago

Hi, thanks for your clarify! I've updated the README.md with more pre-training details, you can refer to the updated README!

DiederikVink commented 10 months ago

Thanks for the details on pre-training on a large corpus, that'll be useful as I go along as well. For what I'm working on now, I was actually looking for the hyperparameters you used to finetune the 216M model on Wikitext-2 and Wikitext-103, specifically the batch size, the learning rates and the number of epochs assuming this is being trained on 4xV100 GPUs as stated in the paper.

ridgerchu commented 10 months ago

Hi, sorry for the misunderstanding. I've also uploaded the pre-tokenized wikitext103 dataset and the README for the detailed information to fine-tune this model.

diederik-vink commented 10 months ago

Thanks for updating this! I've attempted fine-tuning, but I'm noticing a very slow runtime. I am trying to run fine-tuning on the wikitext-2 dataset. My config is as follows:

ctx_len = 1024        # ===> increase T_MAX in model.py if your ctx_len > 1024  
n_layer = 18  
n_embd = 768   
model_type = 'RWKV'  
batch_size = 3  
lr_init = 3e-6  
lr_final = 3e-6  
n_epoch = 10  
epoch_length_fixed = 10000  

If you have any suggestions as to why this would lead to such a slow runtime (6.5hrs per epoch) those would be most welcome!

For further investigation, I've tried running training with your default setup (as specified in train.py in this repo) on the enwik8 dataset.The paper reports running training in 48hrs for the 216M model. Although I am not sure how many epochs training was run for, the train.py file seems to indicate this is run for 1000 'mini-epochs'. The paper quotes runtimes in the range of 48hrs, yet currently it is taking me 9hrs per 'mini-epoch' when splitting training over 4x V100 GPUs. This is confusing as this would indicate that training would take 9000hrs rather than 48hrs. The setup we ran could not fit a batchsize of 12. The highest batchsize that has managed to fit is 3. The working setup is running on 4x V100 GPUs using the hugging face accelerate to parallelize the work across the 4 GPUs.

Do you have any advice as to what this discrepancy between the setup run and the one listed in the paper?

ridgerchu commented 9 months ago

Hi,

It seems the key issue impacting your runtime is the number of mini-epochs used in training. The mini-epoch count should be calculated based on the total training tokens, which is a product of the mini-epoch numbers, iteration numbers, and context length. This count directly influences the training duration.

For the Wikitext-2 dataset, the total token count is considerably smaller compared to the defaults set in the training configuration. Hence, if you're using a higher mini-epoch count (like the default setup) and higher iteration numbers, it will significantly prolong the training time. I recommend recalibrating the mini-epoch count to align with the actual size of your dataset. This adjustment should bring your training duration closer to expected timelines.

Hope this helps in optimizing your training process!

diederik-vink commented 9 months ago

Hi,

Thanks for the response. What are values you used for n_epochs, ctx_length, lr_init, lr_final, epoch_length_fixed and batch_size to be able to replicate the results in your paper for Wikitext-2 and Wikitext-103 while running 4x V100 GPUs?

ridgerchu commented 9 months ago

Hi,

I suggest enabling DeepSpeed for your training process. It significantly boosts performance, especially on multi-GPU setups. In my experience with a V100 GPU, DeepSpeed offered a 3x-4x acceleration. It also allows for larger batch sizes, which is beneficial. After activating DeepSpeed, do monitor your VRAM usage. Here's an adjusted configuration to consider:

ctx_len = 1024        # Increase T_MAX in model.py if your ctx_len exceeds 1024  
n_layer = 18  
n_embd = 768  
model_type = 'RWKV'  
batch_size = 3        # Adjust based on VRAM capacity 
lr_init = 3e-6  
lr_final = 3e-6  
n_epoch = 1  
epoch_length_fixed = 10000