huggingface / accelerate

🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support
https://huggingface.co/docs/accelerate
Apache License 2.0
7.85k stars 955 forks source link

Issue passing model into accelerate.prepare() --> Cannot convert to int without overflow #2500

Closed eobrien2002 closed 6 months ago

eobrien2002 commented 7 months ago

Info

Hi!

I am getting an error (seen below) when passing my model into accelerate.prepare() after I have already trained the model on one participant's worth of data. I've attached the start of the code where the error occurs.

The train() method (as seen in the main() below) returns the model when it's done with the current participant data. I am using a multinode multi GPU on SLURM.

I tried a different approach where I only passed the model into accelerate.prepare() once it was loaded in; however, I got a NCCL timeout error when starting the next participant.

I'm open to suggestions to reconfigure my training approach (note, each participant is like 4GB of data). Essentially, my objective is to train my model on a dataset for one participant at a time and need to continue distributing the data across the GPUs while keeping the same model trained throughout.

Here is the error: main()
  File "/project/6037638/eobrie22/v2/ldm_train_v4.py", line 77, in main
    train_loader,progress_bar,global_step,lr_scheduler,max_train_steps,test_loader, model = prepare_training(config,train_loader,accelerator,test_loader,optimizer,model)
  File "/project/6037638/eobrie22/v2/ldm_train_v4.py", line 112, in prepare_training
    train_loader, lr_scheduler,test_loader,optimizer,model = accelerator.prepare(train_loader, lr_scheduler,test_loader,optimizer,model)
  File "/project/6037638/eobrie22/v2/myenv/lib/python3.10/site-packages/accelerate/accelerator.py", line 1213, in prepare
    result = tuple(
  File "/project/6037638/eobrie22/v2/myenv/lib/python3.10/site-packages/accelerate/accelerator.py", line 1214, in <genexpr>
    self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
  File "/project/6037638/eobrie22/v2/myenv/lib/python3.10/site-packages/accelerate/accelerator.py", line 1094, in _prepare_one
    return self.prepare_model(obj, device_placement=device_placement)
  File "/project/6037638/eobrie22/v2/myenv/lib/python3.10/site-packages/accelerate/accelerator.py", line 1349, in prepare_model
    model = torch.nn.parallel.DistributedDataParallel(
  File "/project/6037638/eobrie22/v2/myenv/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 798, in __init__
    result = tuple(
  File "/project/6037638/eobrie22/v2/myenv/lib/python3.10/site-packages/accelerate/accelerator.py", line 1214, in <genexpr>
    self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
  File "/project/6037638/eobrie22/v2/myenv/lib/python3.10/site-packages/accelerate/accelerator.py", line 1094, in _prepare_one
    return self.prepare_model(obj, device_placement=device_placement)
  File "/project/6037638/eobrie22/v2/myenv/lib/python3.10/site-packages/accelerate/accelerator.py", line 1349, in prepare_model
    model = torch.nn.parallel.DistributedDataParallel(
  File "/project/6037638/eobrie22/v2/myenv/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 798, in __init__
    result = tuple(
  File "/project/6037638/eobrie22/v2/myenv/lib/python3.10/site-packages/accelerate/accelerator.py", line 1214, in <genexpr>
    self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
  File "/project/6037638/eobrie22/v2/myenv/lib/python3.10/site-packages/accelerate/accelerator.py", line 1094, in _prepare_one
    return self.prepare_model(obj, device_placement=device_placement)
  File "/project/6037638/eobrie22/v2/myenv/lib/python3.10/site-packages/accelerate/accelerator.py", line 1349, in prepare_model
    model = torch.nn.parallel.DistributedDataParallel(
  File "/project/6037638/eobrie22/v2/myenv/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 798, in __init__
    _verify_param_shape_across_processes(self.process_group, parameters)
  File "/project/6037638/eobrie22/v2/myenv/lib/python3.10/site-packages/torch/distributed/utils.py", line 263, in _verify_param_shape_across_processes
    _verify_param_shape_across_processes(self.process_group, parameters)
  File "/project/6037638/eobrie22/v2/myenv/lib/python3.10/site-packages/torch/distributed/utils.py", line 263, in _verify_param_shape_across_processes
    _verify_param_shape_across_processes(self.process_group, parameters)
  File "/project/6037638/eobrie22/v2/myenv/lib/python3.10/site-packages/torch/distributed/utils.py", line 263, in _verify_param_shape_across_processes
    return dist._verify_params_across_processes(process_group, tensors, logger)
RuntimeError: value cannot be converted to type int without overflow
    return dist._verify_params_across_processes(process_group, tensors, logger)
RuntimeError: value cannot be converted to type int without overflow
    return dist._verify_params_across_processes(process_group, tensors, logger)
RuntimeError: value cannot be converted to type int without overflow
[2024-02-27 17:59:57,054] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 126965 closing signal SIGTERM

Reproduction

def main():
    config = Config()
    # Set up the accelerator
    accelerator = Accelerator()
    set_seed(42) 

    logger.info(accelerator.state)

    # Load the models
    model, noise_scheduler, optimizer = load_models(config,accelerator)

    for participant in config.participants:
        # Set up the EEG dataset
        with accelerator.main_process_first():
            if accelerator.is_main_process:
                logger.info(f"Training participant {participant}")
            train_dataset, test_dataset, tmp_path = load_and_preprocess_eeg_data(participant)

        train_loader, test_loader = create_EEG_dataset(config.train_batch_size,train_dataset, test_dataset)
        if accelerator.is_main_process:
            logger.info(f"Data is {len(train_loader)} batches long")

        # Prepare the training
        train_loader,progress_bar,global_step,lr_scheduler,max_train_steps,test_loader, model = prepare_training(config,train_loader,accelerator,test_loader,optimizer,model)
        # Train the model
        logger.info(f"Rank: {accelerator.state.process_index}; Training started with {len(train_loader)} batches")
        model = train(train_loader,accelerator, model, optimizer, lr_scheduler, config, global_step, progress_bar,participant, test_loader,noise_scheduler,max_train_steps)
def prepare_training(config,train_loader,accelerator,test_loader,optimizer,model):
    num_update_steps_per_epoch = math.ceil(len(train_loader) / config.gradient_accumulation_steps)
    max_train_steps = config.num_train_epochs * num_update_steps_per_epoch
    num_warmup_steps = 0.1 * max_train_steps  # 10% of max_train_steps as an example
    num_training_steps = max_train_steps
    global_step = 0
    progress_bar = tqdm(range(0, max_train_steps),initial=global_step,desc="Steps",disable=not accelerator.is_local_main_process)

    # Set up the learning rate scheduler
    lr_scheduler = get_scheduler("linear",optimizer=optimizer,num_warmup_steps=num_warmup_steps * accelerator.num_processes,num_training_steps=num_training_steps * accelerator.num_processes)

    # Prepare everything for the accelerator so that it can be used for distributed training
    train_loader, lr_scheduler,test_loader,optimizer,model = accelerator.prepare(train_loader, lr_scheduler,test_loader,optimizer,model)
    #weight_dtype = torch.float32

    # Move the models to the accelerator device
    logger.info(f"Model and Data prepared on {accelerator.state.process_index}") 

    return train_loader,progress_bar,global_step,lr_scheduler,max_train_steps,test_loader, model

Expected behavior

I expected the acclerate.prepare() to take the model that is currently being trained and then prepare it again with the new data. Potentially the issue is that it is already wrapped in the accelerator but my model is a pretty complex with a unet, vae, encoder and a few different layers built into it that got an error that the Accelerator could not find the transformer to wrap when I tried to do FSDP

eobrien2002 commented 7 months ago

RE: This is the error I get when I unwrap the model then try to prepare it again with a new dataset:

RuntimeError: value cannot be converted to type int without overflow return dist._verify_params_across_processes(process_group, tensors, logger)

Any help would be greatly appreciated

github-actions[bot] commented 6 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.