Open yurakuratov opened 1 year ago
Yes we will improve documentation on pre-training, I will discuss with Mosaic ML about what we should post.
thank you for such incredible work! are you able to comment on how the new tokenizer was created - that is - were the combined tokens added to the "end" of the gpt tokenizer, or were tokens removed, etc.? how were the new token embeddings initialized? again a huge thank you for this amazing service to the open source ML community!
A brand new tokenizer was trained with 28896 tokens. I'll upload the training script to this repo.
I put this in the tokenize
folder. I just ran it on a file with all of the text from the PubMed abstracts.
When you launch pretraining from scratch with Hugging Face and Composer combination we had, it will just randomly initialize the embeddings ...
I believe this is where embeddings get initialized ...
Thank you thank you! In the blog post you say "PubMedGPT 2.7B was trained on all the PubMed abstracts and full documents from The Pile." - so do you start with a pre-trained model (like GPT Neo-2.7B that was pre-trained with a different tokenizer, and trained on The Pile) then change tokenizers and then train again on PubMed, or do you just mix pubmed data with the pile data and start the whole thing from scratch? A huge thank you again - this is so cool
Everything was from scratch. So we trained the tokenizer first. And then pre-trained the model from scratch using the new tokenizer. There is no connection to any other tokenizer or model.
Hi @J38.
Any updates on making the pre-training code of BioMedLM public?
Hi! I could not find pre-training code as it was mentioned in the blog post:
https://www.mosaicml.com/blog/introducing-pubmed-gpt
Are you planing to make it public? It could help to understand how the model was actually trained with MosaicML's Composer? Another question is how the model trained with FlashAttention was converted to Huggingface-compatible GPT2LMHeadModel checkpoint?