Open tanliboy opened 1 month ago
all models are trained with causal language modeling. MLM is out of the scope I think for this project.
Thanks for your reply, @xiyang-aads-lilly !
In the case that we need to fine-tune on a small set of documents (<50M tokens), what would be the best strategy to integrate the knowledge into the LLMs without causing significant regressions on LLMs?
I have heard discussions between re-warming + re-sampling for continued pre-training vs. generating conversational data for instruction fine-tuning. Given we use SFT for both continued pre-training and instruction fine-tuning (assuming not using completion-only data loader), it seems that it is unnecessary to generate conversational data for instruction fine-tuning. Thoughts?
Hi Team,
It is amazing handbook. In the continued pre-training script (
run_cpt.py
), I saw that it is not using "mlm" (Masked Language Model) parameter in the training process. I though that the training method mlm vs. forward prediction is the major differentiation between pre-training and supervised fine-tuning.Thanks! Li