mosaicml / composer

Supercharge Your Model Training
http://docs.mosaicml.com
Apache License 2.0
5.15k stars 417 forks source link

Skip nn.ModuleList in FSDP auto wrapping #2430

Closed jmif closed 1 year ago

jmif commented 1 year ago

I've got a setup that roughly looks like this:

tokenizer = transformers.AutoTokenizer.from_pretrained("gpt2")
tokenizer.padding_side = "left"
tokenizer.pad_token = tokenizer.eos_token

model = transformers.AutoModelForSequenceClassification.from_pretrained(
    "gpt2", num_labels=label_feature.num_classes
)

model.config.pad_token_id = model.config.eos_token_id

composer_model = HuggingFaceModel(
    model, tokenizer=tokenizer, metrics=metrics, use_logits=True
)

metrics = [
    CrossEntropy(),
    MulticlassAccuracy(num_classes=label_feature.num_classes, average="micro", top_k=3),
    ConfusionMatrix(task="multiclass", num_classes=label_feature.num_classes),
]

optimizer = AdamW(
    params=composer_model.parameters(),
    lr=3e-5,
    betas=(0.9, 0.98),
    eps=1e-6,
    weight_decay=3e-6,
)

linear_lr_decay = LinearLR(optimizer, start_factor=1.0, end_factor=0, total_iters=150)

trainer = Trainer(
    device_train_microbatch_size="auto",
    model=composer_model,  # This is the model from the HuggingFaceModel wrapper class.
    train_dataloader=train_dataloader,
    eval_dataloader=evaluators,
    optimizers=optimizer,
    schedulers=[linear_lr_decay],
    device="gpu" if torch.cuda.is_available() else "cpu",
    precision="fp32",
    seed=42,
    loggers=[wandb_logger],
    save_interval="500ba",
    fsdp_config={
        "mixed_precision": {
            "param_dtype": None,
            "reduce_dtype": "torch.float32",
            "buffer_dtype": "torch.float32",
        },
        "verbose": True,
    }
)

When I go to train this, I get an error that says FullyShardedDataParallel has no method len(). Upon digging into the model code, I found that the underlying GPT2 model has a nn.ModuleList as one of it's members.

The ModuleList gets wrapped and this ends up breaking the training run because the training run references len(self.h). I was able to fix this by setting _fsdp_wrap to false on the ModuleList and _fsdp_wrap to true on all of the modules in the ModuleList. It occurred to me that it may not make sense to wrap ModuleList by default so opening an issue in case there is an opportunity to improve the default wrap implementation or fix a bug there. I'm fairly new at this, so this may be uninformed :).

bcui-db commented 1 year ago

Hey thank you for pointing this out! I think this went through a slightly unexpected code path. In particular, our repo llm-foundry is how we typically interact with HF models. For example:

https://github.com/mosaicml/llm-foundry/blob/main/scripts/train/finetune_example/gpt2-arc-easy--cpu.yaml#L8

Here we have code that makes it easier to interact with HF models + FSDP:

https://github.com/mosaicml/llm-foundry/blob/965bd374ff968a1c7d74a56c80a1730968e04e87/llmfoundry/models/hf/model_wrapper.py#L64

We are open to PRs to help smooth out the wrapping process.

jmif commented 1 year ago

Ah ok I see the code, thanks for the reference. I'll close for now but will keep this in mind as I continue to understand the project, would be happy to contribute once I have a better understanding of things. Thanks!