code for pubmedgpt pre-training

yurakuratov commented 1 year ago

Hi! I could not find pre-training code as it was mentioned in the blog post:

To train Pubmed GPT easily, quickly, and efficiently, we used the MosaicML Cloud for infrastructure and trained the model using MosaicML’s Composer and Streaming Dataset libraries. All model and training code is built off of PyTorch. See the code here!

https://www.mosaicml.com/blog/introducing-pubmed-gpt

Are you planing to make it public? It could help to understand how the model was actually trained with MosaicML's Composer? Another question is how the model trained with FlashAttention was converted to Huggingface-compatible GPT2LMHeadModel checkpoint?

J38 commented 1 year ago

Yes we will improve documentation on pre-training, I will discuss with Mosaic ML about what we should post.

metemadi commented 1 year ago

thank you for such incredible work! are you able to comment on how the new tokenizer was created - that is - were the combined tokens added to the "end" of the gpt tokenizer, or were tokens removed, etc.? how were the new token embeddings initialized? again a huge thank you for this amazing service to the open source ML community!

J38 commented 1 year ago

A brand new tokenizer was trained with 28896 tokens. I'll upload the training script to this repo.

J38 commented 1 year ago

I put this in the tokenize folder. I just ran it on a file with all of the text from the PubMed abstracts.

J38 commented 1 year ago

When you launch pretraining from scratch with Hugging Face and Composer combination we had, it will just randomly initialize the embeddings ...

J38 commented 1 year ago

I believe this is where embeddings get initialized ...

https://github.com/huggingface/transformers/blob/7032e0203262ebb2ebf55da8d2e01f873973e835/src/transformers/models/gpt2/modeling_gpt2.py#L462

metemadi commented 1 year ago

Thank you thank you! In the blog post you say "PubMedGPT 2.7B was trained on all the PubMed abstracts and full documents from The Pile." - so do you start with a pre-trained model (like GPT Neo-2.7B that was pre-trained with a different tokenizer, and trained on The Pile) then change tokenizers and then train again on PubMed, or do you just mix pubmed data with the pile data and start the whole thing from scratch? A huge thank you again - this is so cool

J38 commented 1 year ago

Everything was from scratch. So we trained the tokenizer first. And then pre-trained the model from scratch using the new tokenizer. There is no connection to any other tokenizer or model.

shashank140195 commented 1 year ago

Hi @J38.

Any updates on making the pre-training code of BioMedLM public?

stanford-crfm / BioMedLM

code for pubmedgpt pre-training #2