Steps to run the code and optimal parameters for continual pretraining

gs7vik commented 5 months ago

Is it possible to add more detailed description on how to run this code? For example does the code support loading models from huggingface or we have to download the model and give the path of it (this is the option I observed by having a look at the code). And while giving the data path should the data has to be a json or jsonl or csv file. If they were specified, it would be more helpful. And do you recommend any optimal parameters for continual pretraining(i.e. I am trying to inject custom domain knowledge to an llm model). And do you have any recommendations for number of training samples to make the training a success? Thanks a lot for this approach :) !

kongds commented 5 months ago

Thanks for your interesting to our paper.

Our training script currently supports loading models from HF but requires tokenized data (which should be preprocessed before training and saved using HF datasets containing columns like input_ids). However, you can use other convenient training scripts by simply installing our PEFT and adding the corresponding configuration lines in PeftConfig, which can be found in the README.

For the optimal parameters for continual pretraining, it seems that more parameters can achieve better results. For the number of training samples, our continual pretraining uses around 1.3B tokens with 5000 steps and batch sizes of 128. This may depend on the amount of continual pretraining data or computational resources. Detail hyperparameters can be found in Appendix A of the paper.

gs7vik commented 5 months ago

oh okay will go through it. Thanks a lot!

kongds / MoRA

Steps to run the code and optimal parameters for continual pretraining #5