huggingface / blog

Public repo for HF blog posts
https://hf.co/blog
2.25k stars 704 forks source link

Feature Request: SmolLM tutorial documentation #2256

Open kadirnar opened 1 month ago

kadirnar commented 1 month ago

Hi,

I want to train the SmolLM model. I couldn't find any information about the tokenizer on the blog or model cards. Can you help me with the training code?

Example Code:

from datasets import load_dataset
from trl import SFTConfig, SFTTrainer

train_dataset = load_dataset("prompt-enhancer-dataset", split="train")
eval_dataset = load_dataset("prompt-enhancer-dataset", split="test")

sft_config = SFTConfig(
    dataset_text_field="short_prompt",
    max_seq_length=1024,
    output_dir="smol_output",
    num_train_epochs=3,
    per_device_eval_batch_size = 32,
    per_gpu_train_batch_size = 32,
    learning_rate=2e-4,
    warmup_steps=500,
    weight_decay=0.01,
    lr_scheduler_type="cosine",
    logging_dir="smol_output",
    hub_model_id="kadirnar/SmolLM-360M-prompt-enhancer",
    push_to_hub = True,
    hub_token="hf_token"

)
trainer = SFTTrainer(
    "SmolLM-360M",
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    args=sft_config,    
)
trainer.train()
osanseviero commented 1 month ago

cc @eliebak

eliebak commented 1 month ago

Hey, thanks for your message, the tokenizer is here HuggingFaceTB/cosmo2-tokenizer, will add it to the model card thx for noticing it! :)

kadirnar commented 1 month ago

@eliebak Could you check the link again?(404 Error) Could you add a sample training code?

eliebak commented 1 month ago

Should work now sorry (you also have a tokenizer file in the model btw). For the training code we will release it soon, in the main time there is some example on how you can train a model with nanotron here https://github.com/huggingface/nanotron/tree/main/examples :)