huggingface / accelerate

🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support
https://huggingface.co/docs/accelerate
Apache License 2.0
7.95k stars 968 forks source link

Multi-GPU CLI issue #39

Closed dzorlu closed 3 years ago

dzorlu commented 3 years ago

Hi- Thanks for the great library, Sylvain!

The config file looks as follows:

compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
fp16: true
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
num_machines: 1
num_processes: 2

The relevant part of the code is as follows:

    accelerator = Accelerator(fp16=config['fp16'], cpu=config['cpu'])
    print(accelerator.device)

    # Sample hyper-parameters for learning rate, batch size, seed and a few other HPs
    lr = config["lr"]
    num_epochs = int(config["num_epochs"])
    seed = int(config["seed"])
    batch_size = int(config["batch_size"])

    # If the batch size is too big we use gradient accumulation
    gradient_accumulation_steps = 1
    if batch_size > MAX_GPU_BATCH_SIZE:
        gradient_accumulation_steps = batch_size // MAX_GPU_BATCH_SIZE
        batch_size = MAX_GPU_BATCH_SIZE

    # Instantiate dataloaders.
    train_dataloader = DataLoader(
        train_dataset, shuffle=True, collate_fn=collate_fn, batch_size=batch_size
    )
    valid_dataloader = DataLoader(
        validation_dataset, shuffle=False, collate_fn=collate_fn, batch_size=EVAL_BATCH_SIZE
    )
    test_dataloader = DataLoader(
        test_dataset, shuffle=False, collate_fn=collate_fn, batch_size=EVAL_BATCH_SIZE
    )

    # Instantiate the model (we build the model here so that the seed also control new weights initialization)
    model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")

    # Instantiate optimizer
    optimizer = AdamW(params=model.parameters(), lr=lr)

    # Prepare everything
    # There is no specific order to remember, we just need to unpack the objects in the same order we gave them to the
    # prepare method.
    prepared = accelerator.prepare(
        model, optimizer, train_dataloader, valid_dataloader, test_dataloader
    )
    model, optimizer, train_dataloader, valid_dataloader, test_dataloader = prepared

    # Now we train the model
    for epoch in range(num_epochs):
        model.train()
        for step, batch in enumerate(train_dataloader):
            # We could avoid this line since we set the accelerator with `device_placement=True`.
            #batch.to(accelerator.device)
            outputs = model(**batch)
            loss = outputs.loss
            loss = loss / gradient_accumulation_steps
            accelerator.backward(loss)
            if step % gradient_accumulation_steps == 0:
                optimizer.step()
                lr_scheduler.step()
                optimizer.zero_grad()

The script utilizes a single GPU, though there are 2 GPUS.

>>> torch.cuda.device_count()
2

Launching the scipt in the command line:

accelerate launch training.py

The print statement print(accelerator.device) returns following (happy to add more debugging)

cuda

Any help is appreciated. Thank you!

sgugger commented 3 years ago

Thanks for all the info. Could you run accelerate test and paste here the output?

dzorlu commented 3 years ago

Thanks for the fast response


Running:  accelerate-launch /usr/local/lib/python3.6/dist-packages/accelerate/test_utils/test_script.py --config_file=None
stdout: **Initialization**
stdout: Testing, testing. 1, 2, 3.
stdout: Distributed environment: NO
stdout: Num processes: 1
stdout: Process index: 0
stdout: Local process index: 0
stdout: Device: cuda
stdout: Use FP16 precision: False
stdout: 
stdout: 
stdout: **Test random number generator synchronization**
stdout: All rng are properly synched.
stdout: 
stdout: **DataLoader integration test**
stdout: Non-shuffled dataloader passing.
stdout: Shuffled dataloader passing.
stdout: 
stdout: **Training integration test**
stdout: Training yielded the same results on one CPU or distributed setup with no batch split.
stdout: Training yielded the same results on one CPU or distributes setup with batch split.
Test is a success! You are ready for your distributed training!
sgugger commented 3 years ago

Ok, so it looks like your config is not recognized (it doesn't launch 2 processes). So the problem is here, not in your training script.

Are you sure it's the one in ~/.cache/huggingface/accelerate/default_config.yaml? You don't have some environment variable that changes the cache directory in any wat?

dzorlu commented 3 years ago

This seems to be a false alarm, the process now sees both GPUs. Thank you for the quick turnaround. Can't wait to use the library more. Deniz

sgugger commented 3 years ago

Closing the issue then, but feel free to reopen if you get the problem again!