Initial LMBuddy class for running jobs

sfriedowitz commented 7 months ago

What's changing

Removes the run_job method in favor of a class LMBuddy that has methods for finetune and evaluate
Implements job-result data structures for returning data generated by job entrypoints, instead of simply having the entrypoints write to disk/W&B
Implements a new LoadableAssetPath type and associated data structures to represent any load_from path for a HF asset. See inline comments for motivation for this change.

Note that the CLI API is not changed by these internal changes, so you can still execute the package as a Ray entrypoint in the same manner as before.

How to test it

Run the test suite
Pull the branch and try running some jobs with the new interface

Related Jira Ticket

Additional notes for reviewers

In follow-up PRs into this dev branch, I would like to do the following:

veekaybee commented 7 months ago

Looking now! Pulled the branch, reading the instructions for direct_job_execution.ipynb and have put together a a script like this. Where do we specify the cluster information in the new workflow? Seems like it should be somewhere here right? FinetuningRayConfig

from ray.job_submission import JobSubmissionClient
from pathlib import Path

from lm_buddy import LMBuddy
from lm_buddy.jobs.configs import (
    FinetuningJobConfig,
    FinetuningRayConfig,
    LMHarnessJobConfig,
    LMHarnessEvaluationConfig,
)
from lm_buddy.integrations.huggingface import (
    AutoModelConfig,
    TextDatasetConfig,
    TrainerConfig,
    AdapterConfig,
)
from lm_buddy.integrations.wandb import WandbRunConfig

# Base model to finetune from HuggingFace
model_config = AutoModelConfig(load_from="distilgpt2")

# Text dataset for finetuning
dataset_config = TextDatasetConfig(
    load_from="imdb",
    split="train[:100]",
    text_field="text",
)

# HuggingFace trainer arguments
trainer_config = TrainerConfig(
    max_seq_length=256,
    per_device_train_batch_size=8,
    learning_rate=1e-4,
    num_train_epochs=1,
    logging_strategy="steps",
    logging_steps=1,
    save_strategy="epoch",
    save_steps=1,
)

# LORA adapter settings
adapter_config = AdapterConfig(
    peft_type="LORA",
    task_type="CAUSAL_LM",
    r=8,
    lora_alpha=16,
    lora_dropout=0.2,
)

# Define tracking for finetuning run
tracking_config = WandbRunConfig(
    name="example-finetuning",
    project="lm-buddy-examples",  # Update to your project name
    entity="mozilla-ai",  # Update to your entity name
)

# Ray train settings
ray_config = FinetuningRayConfig(
    use_gpu=False,  # Change to True if GPUs are available on your machine
    num_workers=2,
)

# Full finetuning config
finetuning_config = FinetuningJobConfig(
    model=model_config,
    dataset=dataset_config,
    trainer=trainer_config,
    adapter=adapter_config,
    tracking=tracking_config,
    ray=ray_config,
)

sfriedowitz commented 6 months ago

Where do we specify the cluster information in the new workflow?

Nothing is changing in how you specify the cluster information. The CLI of the package is not changed, so you can use the same commands as an entrypoint to a Ray job submission using their SDK.

veekaybee commented 6 months ago

Tests and left some comments, unit tests pass and sample job works!

sfriedowitz commented 6 months ago

Thanks! Im a bit side tracked atm but will address most all of them in the next few hours.

mozilla-ai / lm-buddy