huggingface / accelerate

🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support
https://huggingface.co/docs/accelerate
Apache License 2.0
7.75k stars 936 forks source link

Multi GPU Training Not Working #36

Closed ashim-mahara closed 3 years ago

ashim-mahara commented 3 years ago

While using Accelerate, it is only utilizing 1 out of the 2 GPUs present. I am training using the general instructions in the repository. The architecture is AutoEncoder.

dataloader = DataLoader(dataset, batch_size = 2048, shuffle=True, pin_memory=False, num_workers=20)
encoder = Encoder(bottleneck_size = 2, embedding_size = 40, vocab = dataset.vocab).to(device)
decoder = Decoder(bottleneck_size = 2, embedding_size = 40, vocab = dataset.vocab).to(device)
model = AutoEncoder(encoder, decoder).to(device)
loss = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters())
model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)

I am transferring the samples in the batch to the device using the code below:

    for x in batch:
        batch[x] = batch[x].to(device)

The device is being determined by using:

device = accelerator.device

Both devices are visible which can be confirmed by using torch.cuda.device_count() which returns 2.

Devices are RTX 2080 with CUDA Version 11.2. Driver version is 460.67. Distro is PopOS!.

xamm commented 3 years ago

Did you set "num_processes" in your config to 2?

I noticed, that when this parameter is set to 1 only one gpu will be used.

My working configuration is this:

{
  "distributed_type": "MULTI_GPU",
  "fp16": true,
  "machine_rank": 0,
  "main_process_ip": null,
  "main_process_port": null,
  "main_training_function": "main",
  "num_machines": 1,
  "num_processes": 2
}
sgugger commented 3 years ago

Could you please share the command you are using to launch your script? That would help debug your problem. Thanks!

ashim-mahara commented 3 years ago

I am using this with JupyterLab, is it only working while training with scripts or while using the CLI tool?

sgugger commented 3 years ago

Yes, the accelerate library currently only supports launching training scripts. Notebook launchers are on the roadmap, but not implemented yet.

dzorlu commented 3 years ago

Hi- Thanks for the great library, Sylvain! Not to hijack the thread, but I am having the same problem. (Happy to create a new issue, though).

The config file looks as follows:

compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
fp16: true
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
num_machines: 1
num_processes: 2

The relevant part of the code is as follows:

    accelerator = Accelerator(fp16=config['fp16'], cpu=config['cpu'])
    print(accelerator.device)

    # Sample hyper-parameters for learning rate, batch size, seed and a few other HPs
    lr = config["lr"]
    num_epochs = int(config["num_epochs"])
    seed = int(config["seed"])
    batch_size = int(config["batch_size"])

    # If the batch size is too big we use gradient accumulation
    gradient_accumulation_steps = 1
    if batch_size > MAX_GPU_BATCH_SIZE:
        gradient_accumulation_steps = batch_size // MAX_GPU_BATCH_SIZE
        batch_size = MAX_GPU_BATCH_SIZE

    # Instantiate dataloaders.
    train_dataloader = DataLoader(
        train_dataset, shuffle=True, collate_fn=collate_fn, batch_size=batch_size
    )
    valid_dataloader = DataLoader(
        validation_dataset, shuffle=False, collate_fn=collate_fn, batch_size=EVAL_BATCH_SIZE
    )
    test_dataloader = DataLoader(
        test_dataset, shuffle=False, collate_fn=collate_fn, batch_size=EVAL_BATCH_SIZE
    )

    # Instantiate the model (we build the model here so that the seed also control new weights initialization)
    model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")

    # Instantiate optimizer
    optimizer = AdamW(params=model.parameters(), lr=lr)

    # Prepare everything
    # There is no specific order to remember, we just need to unpack the objects in the same order we gave them to the
    # prepare method.
    prepared = accelerator.prepare(
        model, optimizer, train_dataloader, valid_dataloader, test_dataloader
    )
    model, optimizer, train_dataloader, valid_dataloader, test_dataloader = prepared

    # Now we train the model
    for epoch in range(num_epochs):
        model.train()
        for step, batch in enumerate(train_dataloader):
            # We could avoid this line since we set the accelerator with `device_placement=True`.
            #batch.to(accelerator.device)
            outputs = model(**batch)
            loss = outputs.loss
            loss = loss / gradient_accumulation_steps
            accelerator.backward(loss)
            if step % gradient_accumulation_steps == 0:
                optimizer.step()
                lr_scheduler.step()
                optimizer.zero_grad()

The script utilizes a single GPU, though there are 2 GPUS.

>>> torch.cuda.device_count()
2

Launching the scipt in the command line:

accelerate launch training.py

Any help is appreciated. Thank you!

sgugger commented 3 years ago

Yes, a new issue would be cleaner as this one is about accelerate not working out of thee box in a Jupyter environment :-)

Could you report in your issue what gets printed (since you have a print(accelerator.device) in your script)? Thanks!

ashim-mahara commented 3 years ago

Should I close this issue then?

sgugger commented 3 years ago

I'll work on a notebook launcher in the coming days. You can close the issue now or when it's ready, as you prefer :-)