UL2 Training with HF Trainer + DeepSpeed Zero3 Results in CUDA Illegal Memory Exception

michaelroyzen commented 1 year ago

System Info

transformers version==4.26.0 torch==1.13.1 deepspeed==0.8 hardware: 8x A100-80GB

Fine-tuning UL2 with the Huggingface Trainer and DeepSpeed Zero2 or Zero3 results in a CUDA Illegal Memory Exception. This is true with any Huggingface Trainer script, PyTorch version (1.12 and 1.113), DeepSpeed version (0.6.7, 0.7.7, 0.8), and CUDA version (11.3 and 11.8) that I've tried. The same scripts work just fine with flan-t5-xxl.

[W CUDAGuardImpl.h:124] Warning: CUDA warning: an illegal memory access was encountered (function destroyEvent)
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Any thoughts @stas00? Your help would be appreciated.

Who can help?

@stas00

Information

[ ] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

Try fine-tuning UL2 on any task/dataset using DeepSpeed Zero2/Zero3. You should encounter the error.

Expected behavior

Training proceeds normally.

stas00 commented 1 year ago

I have never tried running UL2 - please help me to reproduce it

and of course for the future do follow the instructions from the error message to re-running with =1 (except this feature is broken in recent NCCL (pt-1.13) and it'll hang https://github.com/NVIDIA/nccl/issues/750). The async nature often makes it impossible to get a real traceback and CUDA_LAUNCH_BLOCKING=1 turns async mode off and gives you a normal traceback.

michaelroyzen commented 1 year ago

Thank you, @stas00. This is the error with CUDA_LAUNCH_BLOCKING=1:

[2023-01-31 01:03:02,046] [INFO] [utils.py:827:see_memory_usage] DeepSpeedZeRoOffload initialize [begin]
[2023-01-31 01:03:02,047] [INFO] [utils.py:832:see_memory_usage] MA 4.56 GB         Max_MA 4.56 GB         CA 5.48 GB         Max_CA 5 GB 
[2023-01-31 01:03:02,048] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 30.74 GB, percent = 2.3%
Parameter Offload: Total persistent parameters: 664576 in 164 params
[2023-01-31 01:03:02,287] [INFO] [utils.py:827:see_memory_usage] DeepSpeedZeRoOffload initialize [end]
[2023-01-31 01:03:02,289] [INFO] [utils.py:832:see_memory_usage] MA 4.56 GB         Max_MA 4.56 GB         CA 5.48 GB         Max_CA 5 GB 
[2023-01-31 01:03:02,289] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 30.59 GB, percent = 2.3%
terminate called after throwing an instance of 'std::runtime_error'
  what():  NCCL Error 1: unhandled cuda error
terminate called after throwing an instance of 'std::runtime_error'
  what():  NCCL Error 1: unhandled cuda error
terminate called after throwing an instance of 'std::runtime_error'
  what():  NCCL Error 1: unhandled cuda error
terminate called after throwing an instance of 'std::runtime_error'
  what():  NCCL Error 1: unhandled cuda error
terminate called after throwing an instance of 'std::runtime_error'
  what():  NCCL Error 1: unhandled cuda error
terminate called after throwing an instance of 'std::runtime_error'
  what():  NCCL Error 1: unhandled cuda error
terminate called after throwing an instance of 'std::runtime_error'
  what():  NCCL Error 1: unhandled cuda error
terminate called after throwing an instance of 'std::runtime_error'
  what():  NCCL Error 1: unhandled cuda error
[2023-01-31 01:03:08,861] [INFO] [launch.py:286:sigkill_handler] Killing subprocess 26370
[2023-01-31 01:03:08,879] [INFO] [launch.py:286:sigkill_handler] Killing subprocess 26371
[2023-01-31 01:03:08,879] [INFO] [launch.py:286:sigkill_handler] Killing subprocess 26372
[2023-01-31 01:03:08,894] [INFO] [launch.py:286:sigkill_handler] Killing subprocess 26373
[2023-01-31 01:03:08,908] [INFO] [launch.py:286:sigkill_handler] Killing subprocess 26374
[2023-01-31 01:03:09,454] [INFO] [launch.py:286:sigkill_handler] Killing subprocess 26375
[2023-01-31 01:03:09,471] [INFO] [launch.py:286:sigkill_handler] Killing subprocess 26376
[2023-01-31 01:03:09,485] [INFO] [launch.py:286:sigkill_handler] Killing subprocess 26377

stas00 commented 1 year ago

Hmm, I have no idea based on the log. Thank you for sharing it, Michael.

How do I reproduce the problem?

Is it possible that you're running out of cpu memory? sometimes you get cpu-oom event and the program gets culled in the middle of the run, but usually the OS should log this event in the console or syslog.

michaelroyzen commented 1 year ago

You can reproduce the problem by trying to fine-tune UL2 in BF16 using DeepSpeed Zero2/Zero3 and the HF Trainer. Dataset doesn't seem to matter, I think any Seq2Seq fine-tuning script should reproduce it.

I doubt it's a resource issue. It's GCP's a2-ultragpu instance with 1.3TB of CPU mem. GPU memory also seems to be fine. I remember training a UL2 model back in September with DeepSpeed successfully, but now I can't seem to.

Do you have access to an A100 node to try this out?

stas00 commented 1 year ago

Sounds good.

But why is it so difficult to copy-n-paste the commands and configs that fail for you and not have me figure everything out from scratch? Please meet me half way.

michaelroyzen commented 1 year ago

Okay, my bad. It's just all custom, but here goes.

Train:

import functools
import json
import argparse
from datetime import datetime
import os

from utils.dataset_formats import Seq2SeqDataset

import numpy as np
import nltk
import wandb
import torch

from datasets import load_metric
from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer, AutoTokenizer, AutoModelForSeq2SeqLM, AddedToken

class Trainer:
    def __init__(self, args) -> None:
        self.train_dataset = None
        self.val_dataset = None
        self.args = args
        self.metric = load_metric("rouge")
        self.trainer = None

    # Get a Seq2SeqDataset from a json file
    def prepare_datsets_for_training(self) -> None:
        train_data_json = json.load(open(self.args.train))
        val_data_json = json.load(open(self.args.val))

        self.train_dataset = Seq2SeqDataset(train_data_json)
        self.val_dataset = Seq2SeqDataset(val_data_json)

        self.tokenizer = None

    # Train and save a Seq2Seq model
    def train_model(self) -> AutoModelForSeq2SeqLM:
        training_args = Seq2SeqTrainingArguments(output_dir=self.args.save_dir, num_train_epochs=self.args.num_epochs, logging_steps=1, save_steps=self.args.save_steps or self.args.eval_steps,
                                  per_device_train_batch_size=self.args.per_device_train_batch_size, per_device_eval_batch_size=self.args.per_device_eval_batch_size,
                                  logging_dir=args.save_dir, bf16=self.args.bf16, bf16_full_eval=self.args.bf16, fp16=False, gradient_accumulation_steps=self.args.gradient_accumulation_steps, 
                                  overwrite_output_dir=True, evaluation_strategy="steps", eval_steps=self.args.eval_steps,
                                  predict_with_generate=True, report_to="wandb", learning_rate=args.learning_rate, lr_scheduler_type="cosine", gradient_checkpointing=self.args.gradient_checkpointing, deepspeed=args.deepspeed, log_level="error", log_level_replica="error")

        tokenizer = AutoTokenizer.from_pretrained(self.args.model)
        added_tokens = [AddedToken("<"), AddedToken("<SOURCE>"), AddedToken("{"), AddedToken("}"), AddedToken("\n"), AddedToken("\t"), AddedToken("  "), AddedToken("    "), AddedToken("        "), AddedToken("`")]
        tokenizer.add_special_tokens({"additional_special_tokens": added_tokens})
        tokenizer.save_pretrained(self.args.save_dir + "/tokenizer")
        self.tokenizer = tokenizer

        model = AutoModelForSeq2SeqLM.from_pretrained(self.args.model)

        os.environ["WANDB_PROJECT"] = self.args.name

        if torch.distributed.get_rank() == 0:
            run_name = datetime.now().strftime('%b-%d-%I%M%p-%G')
            wandb.tensorboard.patch(root_logdir=self.args.save_dir)
            wandb.init(name=run_name, entity="hellocognition")

            nltk.download('punkt')

        # Barrier for distributed training
        print("Rank {} reached barrier 1".format(torch.distributed.get_rank()))
        torch.distributed.barrier()

        model_collate_fn = functools.partial(
            self.make_batch, tokenizer=tokenizer, max_input_len=self.args.max_input_len, max_target_len=self.args.max_target_len,
        )

        assert self.train_dataset and self.val_dataset

        self.trainer = Seq2SeqTrainer(model=model, args=training_args, train_dataset=self.train_dataset,
                eval_dataset=self.val_dataset, data_collator=model_collate_fn, compute_metrics=self.compute_metrics)

        # Barrier for distributed training
        print("Rank {} reached barrier 2".format(torch.distributed.get_rank()))
        torch.distributed.barrier()

        self.trainer.train()

        if torch.distributed.get_rank() == 0:
            trainer.save(self.args.save_dir + '/final_model')

        return model

    # Truncate examples to max input lengths and make a torch.Tensor input/output batch
    def make_batch(self, example_list: list, tokenizer: AutoTokenizer, max_input_len: int, max_target_len: int):
        model_input_list = [model_input for model_input, _ in example_list]
        gold_answer_list = [gold_answer for _, gold_answer in example_list]
        model_input_tokens = tokenizer.batch_encode_plus(model_input_list, max_length=max_input_len, padding=True, truncation=True)
        model_input_ids, model_input_mask = (
            torch.tensor(model_input_tokens["input_ids"]),
            torch.tensor(model_input_tokens["attention_mask"])
        )
        gold_answer_tokens = tokenizer.batch_encode_plus(gold_answer_list, max_length=max_target_len, padding=True, truncation=True)
        gold_answer_ids, gold_answer_mask = (
            torch.tensor(gold_answer_tokens["input_ids"]),
            torch.tensor(gold_answer_tokens["attention_mask"])
        )

        lm_labels = gold_answer_ids[:, :].contiguous().clone()
        # Set pad tokens to -100 to be ignored by cross entropy loss
        lm_labels[gold_answer_mask[:, :].contiguous() == 0] = -100
        model_inputs = {
            "input_ids": model_input_ids,
            "attention_mask": model_input_mask,
            "labels": lm_labels,
        }
        return model_inputs

    # Compute ROUGE metrics
    def compute_metrics(self, eval_pred: list):
        predictions, labels = eval_pred
        decoded_preds = self.tokenizer.batch_decode(predictions, skip_special_tokens=False)
        # Replace -100 in the labels as we can't decode them.
        labels = np.where(labels != -100, labels, self.tokenizer.pad_token_id)
        decoded_labels = self.tokenizer.batch_decode(labels, skip_special_tokens=False)

        # Rouge expects a newline after each sentence
        decoded_preds = ["\n".join(nltk.sent_tokenize(pred.strip())) for pred in decoded_preds]
        decoded_labels = ["\n".join(nltk.sent_tokenize(label.strip())) for label in decoded_labels]

        result = self.metric.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
        # Extract a few results
        result = {key: value.mid.fmeasure * 100 for key, value in result.items()}

        # Add mean generated length
        prediction_lens = [np.count_nonzero(pred != self.tokenizer.pad_token_id) for pred in predictions]
        result["gen_len"] = np.mean(prediction_lens)

        return {k: round(v, 4) for k, v in result.items()}

if __name__ == "__main__":
    # parse args
    parser = argparse.ArgumentParser(description='Train Argument Parser')
    parser.add_argument('--name', help='name of the model to be trained using the modeltype-datasetname convention, e.g. flan-t5-3B-gpt3', required=True)
    parser.add_argument('--model', help='name or path of the model to train, e.g. google/flan-t5-xl', required=True)
    parser.add_argument('--train', help='path to the json train dataset', required=True)
    parser.add_argument('--val', help='path to the json val dataset', required=True)
    parser.add_argument('--max_input_len', type=int, help='maximum number of tokens allowed in training input', required=True)
    parser.add_argument('--max_target_len', type=int, help='maximum number of tokens allowed in training target output', required=True)
    parser.add_argument('--save_dir', help='save directory after training', required=True)
    parser.add_argument('--num_epochs', type=int, help='number of epochs to train', required=True)
    parser.add_argument('--learning_rate', type=float, help='learning rate', required=True)
    parser.add_argument('--eval_steps', type=int, help='how many steps to eval after', required=True)
    parser.add_argument('--save_steps', type=int, help='how many steps to save after', required=False)
    parser.add_argument('--gradient_accumulation_steps', type=int, help='how many steps to accumulate gradient for (increases effective batch size)', required=True)
    parser.add_argument('--per_device_train_batch_size', type=int, help='train batch size', required=True)
    parser.add_argument('--per_device_eval_batch_size', type=int, help='eval batch size', required=True)
    parser.add_argument('--bf16', help='enable bfloat16 training and eval', default=False, action="store_true")
    parser.add_argument('--gradient_checkpointing', help='allow larger sequence lengths to fit in memory', default=False, action="store_true")
    parser.add_argument('--deepspeed', help='path of the deepspeed config', required=True)
    parser.add_argument('--local_rank')
    args = parser.parse_args()

    # log into wandb
    os.environ['WANDB_API_KEY'] = "WANDB-KEY"

    # make trainer
    trainer = Trainer(args)

    # prepare dataset
    trainer.prepare_datsets_for_training()

    # perform training
    trained_model = trainer.train_model()

Seq2Seq Dataset:

from torch.utils.data import Dataset

class Seq2SeqDataset(Dataset):
    def __init__(self, examples):
        self.examples = examples

    def __len__(self):
        return len(self.examples)

    def make_example(self, i):
        prompt = self.examples[i]["prompt"]
        example_input = self.examples[i]["example_input"]

        gold_answer = self.examples[i]["gold_answer"]
        model_input = "{}\n{}".format(prompt, example_input)

        return (model_input, gold_answer)

    def __getitem__(self, i):
        return self.make_example(i)

ds_config:

{
    "train_micro_batch_size_per_gpu": "auto",
    "gradient_accumulation_steps": "auto",
    "bf16": {
      "enabled": "auto"
    },
    "zero_optimization": {
      "stage": 3,
      "overlap_comm": true,
      "contiguous_gradients": true,
      "sub_group_size": 1e12,
      "reduce_bucket_size": "auto",
      "stage3_prefetch_bucket_size": "auto",
      "stage3_param_persistence_threshold": "auto",
      "stage3_max_live_parameters": 2e9,
      "stage3_max_reuse_distance": 1e9,
      "gather_16bit_weights_on_model_save": true
    },
    "optimizer": {
      "type": "AdamW",
      "params": {
        "lr": "auto"
      }
    }
  }

This works great with flan-t5, but fails on UL2. Here is the detailed error that I get without CUDA_LAUNCH_BLOCKING=1:

[W CUDAGuardImpl.h:124] Warning: CUDA warning: an illegal memory access was encountered (function destroyEvent)                                            
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.                                     
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:31 (most recent call first):                                          
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f49bd1a6457 in /opt/conda/lib/python3.7/site-packages/torch/lib/libc10.so)        
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f49bd1703ec in /opt/conda/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(std::string const&, std::string const&, int, bool) + 0xb4 (0x7f49bd246c64 in /opt/conda/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x1e0dc (0x7f49bd21e0dc in /opt/conda/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)                                 
frame #4: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x244 (0x7f49bd221054 in /opt/conda/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)   
frame #5: <unknown function> + 0x4f6823 (0x7f49aa4ab823 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_python.so)                            
frame #6: c10::TensorImpl::~TensorImpl() + 0x1a0 (0x7f49bd1869e0 in /opt/conda/lib/python3.7/site-packages/torch/lib/libc10.so)                            
frame #7: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f49bd186af9 in /opt/conda/lib/python3.7/site-packages/torch/lib/libc10.so)                              
frame #8: std::vector<at::Tensor, std::allocator<at::Tensor> >::~vector() + 0x8b (0x7f49aa4add1b in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #9: c10d::ProcessGroupNCCL::WorkNCCL::~WorkNCCL() + 0x8c (0x7f491bf3ae8c in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cuda_cpp.so)   
frame #10: c10d::ProcessGroupNCCL::WorkNCCL::~WorkNCCL() + 0x9 (0x7f491bf3b349 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cuda_cpp.so)   
frame #11: <unknown function> + 0xbe302c (0x7f49aab9802c in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_python.so)                           
frame #12: <unknown function> + 0x3e4272 (0x7f49aa399272 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_python.so)                           
frame #13: <unknown function> + 0x3e51af (0x7f49aa39a1af in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_python.so)                           
frame #14: <unknown function> + 0xe5698 (0x56347c18d698 in /opt/conda/bin/python3.7)                                                                       
frame #15: <unknown function> + 0x1f7b89 (0x56347c29fb89 in /opt/conda/bin/python3.7)                                                                      
frame #16: <unknown function> + 0x191de8 (0x56347c239de8 in /opt/conda/bin/python3.7)                                                                      
frame #17: _PyEval_EvalFrameDefault + 0x4c8a (0x56347c26426a in /opt/conda/bin/python3.7)                                                                  
frame #18: _PyFunction_FastCallKeywords + 0x184 (0x56347c1d73d4 in /opt/conda/bin/python3.7)                                                               
frame #19: <unknown function> + 0x191de8 (0x56347c239de8 in /opt/conda/bin/python3.7)                                                                      
frame #20: _PyEval_EvalFrameDefault + 0xb2a (0x56347c26010a in /opt/conda/bin/python3.7)                                                                   
frame #21: <unknown function> + 0x1f7b66 (0x56347c29fb66 in /opt/conda/bin/python3.7)                                                                      
frame #22: _PyFunction_FastCallDict + 0xaef (0x56347c1a78cf in /opt/conda/bin/python3.7)                                                                   
frame #23: _PyEval_EvalFrameDefault + 0x1f86 (0x56347c261566 in /opt/conda/bin/python3.7)                                                                  
frame #24: _PyEval_EvalCodeWithName + 0x33d (0x56347c1a5ccd in /opt/conda/bin/python3.7)                                                                   
frame #25: _PyFunction_FastCallKeywords + 0x320 (0x56347c1d7570 in /opt/conda/bin/python3.7)                                                               
frame #26: <unknown function> + 0x191de8 (0x56347c239de8 in /opt/conda/bin/python3.7)                                                                      
frame #27: _PyEval_EvalFrameDefault + 0x4c8a (0x56347c26426a in /opt/conda/bin/python3.7)                                                                  
frame #28: _PyFunction_FastCallKeywords + 0x184 (0x56347c1d73d4 in /opt/conda/bin/python3.7)                                                               
frame #29: <unknown function> + 0x191de8 (0x56347c239de8 in /opt/conda/bin/python3.7)                                                                      
frame #30: _PyEval_EvalFrameDefault + 0xb2a (0x56347c26010a in /opt/conda/bin/python3.7)                                                                   
frame #31: _PyEval_EvalCodeWithName + 0x33d (0x56347c1a5ccd in /opt/conda/bin/python3.7)                                                                   
frame #32: _PyFunction_FastCallDict + 0x6a0 (0x56347c1a7480 in /opt/conda/bin/python3.7)                                                                   
frame #33: <unknown function> + 0x185ea4 (0x56347c22dea4 in /opt/conda/bin/python3.7)                                                                      
frame #34: _PyObject_FastCallKeywords + 0x18c (0x56347c238b8c in /opt/conda/bin/python3.7)                                                                 
frame #35: <unknown function> + 0x191f79 (0x56347c239f79 in /opt/conda/bin/python3.7)                                                                      
frame #36: _PyEval_EvalFrameDefault + 0x16bb (0x56347c260c9b in /opt/conda/bin/python3.7)                                                                  
frame #37: _PyFunction_FastCallKeywords + 0x184 (0x56347c1d73d4 in /opt/conda/bin/python3.7)                                                               
frame #38: <unknown function> + 0x191de8 (0x56347c239de8 in /opt/conda/bin/python3.7)                                                                      
frame #39: _PyEval_EvalFrameDefault + 0x4c8a (0x56347c26426a in /opt/conda/bin/python3.7)                                                                  
frame #40: _PyFunction_FastCallKeywords + 0x184 (0x56347c1d73d4 in /opt/conda/bin/python3.7)                                                               
frame #41: <unknown function> + 0x191de8 (0x56347c239de8 in /opt/conda/bin/python3.7)                                                                      
frame #42: _PyEval_EvalFrameDefault + 0x4c8a (0x56347c26426a in /opt/conda/bin/python3.7)                                                                  
frame #43: _PyEval_EvalCodeWithName + 0x33d (0x56347c1a5ccd in /opt/conda/bin/python3.7)                                                                   
frame #44: _PyFunction_FastCallDict + 0x6a0 (0x56347c1a7480 in /opt/conda/bin/python3.7)                                                                   
frame #45: <unknown function> + 0x185ea4 (0x56347c22dea4 in /opt/conda/bin/python3.7)                                                                      
frame #46: _PyObject_FastCallKeywords + 0x18c (0x56347c238b8c in /opt/conda/bin/python3.7)                                                                 
frame #47: <unknown function> + 0x191f79 (0x56347c239f79 in /opt/conda/bin/python3.7)                                                                      
frame #48: _PyEval_EvalFrameDefault + 0x16bb (0x56347c260c9b in /opt/conda/bin/python3.7)                                                                  
frame #49: _PyEval_EvalCodeWithName + 0x33d (0x56347c1a5ccd in /opt/conda/bin/python3.7)                                                                   
frame #50: _PyFunction_FastCallDict + 0x6a0 (0x56347c1a7480 in /opt/conda/bin/python3.7)                                                                   
frame #51: _PyEval_EvalFrameDefault + 0x1f86 (0x56347c261566 in /opt/conda/bin/python3.7)                                                                  
frame #52: _PyEval_EvalCodeWithName + 0x33d (0x56347c1a5ccd in /opt/conda/bin/python3.7)                                                                   
frame #53: _PyFunction_FastCallKeywords + 0x320 (0x56347c1d7570 in /opt/conda/bin/python3.7)                                                               
frame #54: <unknown function> + 0x191de8 (0x56347c239de8 in /opt/conda/bin/python3.7)                                                                      
frame #55: _PyEval_EvalFrameDefault + 0x16bb (0x56347c260c9b in /opt/conda/bin/python3.7)                                                                  
frame #56: _PyEval_EvalCodeWithName + 0x33d (0x56347c1a5ccd in /opt/conda/bin/python3.7)                                                                   
frame #57: _PyFunction_FastCallDict + 0x6a0 (0x56347c1a7480 in /opt/conda/bin/python3.7)                                                                   
frame #58: <unknown function> + 0x185a63 (0x56347c22da63 in /opt/conda/bin/python3.7)                                                                      
frame #59: PyObject_Call + 0x6c (0x56347c1b09dc in /opt/conda/bin/python3.7)
frame #60: <unknown function> + 0x21d3e7 (0x56347c2c53e7 in /opt/conda/bin/python3.7)                                                                      
frame #61: _PyObject_FastCallKeywords + 0x3cb (0x56347c238dcb in /opt/conda/bin/python3.7)                                                                 
frame #62: <unknown function> + 0x191f79 (0x56347c239f79 in /opt/conda/bin/python3.7)                                                                      
frame #63: _PyEval_EvalFrameDefault + 0x16bb (0x56347c260c9b in /opt/conda/bin/python3.7)

stas00 commented 1 year ago

and cmd line?

michaelroyzen commented 1 year ago

deepspeed train.py \
    --name ul2-test \
    --model google/ul2 \
    --train <your train file>.json \
    --val <your val file>.json \
    --max_input_len 128 \
    --max_target_len 512 \
    --save_dir <your save directory path> \
    --num_epochs 3 \
    --learning_rate 2e-4 \
    --eval_steps 3000 \
    --gradient_accumulation_steps 8 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --deepspeed utils/ds_config_zero3.json \
    --bf16

I can't share the train files, unfortunately, but as per the Seq2SeqDataset schema, the train/val files are a list of

{
    "prompt": "prompt",
    "example_input": "input",
    "gold_answer": "gold_answer"
}

objects dumped to a JSON file.

stas00 commented 1 year ago

ok, then I can't support you, Michael.

Once you provide a way for me to reproduce the problem I'd be happy to try to understand and come up with a solution.

michaelroyzen commented 1 year ago

Okay, my apologies again. Here are some dummy files that can be used to reproduce the issue.

https://phind-demo.s3.amazonaws.com/demo_train.json https://phind-demo.s3.amazonaws.com/demo_val.json

So the train script would be

deepspeed train.py \
    --name ul2-test \
    --model google/ul2 \
    --train demo_train.json \
    --val demo_val.json \
    --max_input_len 128 \
    --max_target_len 512 \
    --save_dir <your save directory path> \
    --num_epochs 3 \
    --learning_rate 2e-4 \
    --eval_steps 3000 \
    --gradient_accumulation_steps 8 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --deepspeed ds_config.json \
    --bf16

stas00 commented 1 year ago

Now please test that the code you shared works. As it fails here:

  File "train.py", line 183, in <module>
    trained_model = trainer.train_model()
  File "train.py", line 96, in train_model
    self.trainer.train()
  File "/mnt/nvme0/code/huggingface/transformers-master-2/src/transformers/trainer.py", line 1557, in train
    return inner_training_loop(
  File "/mnt/nvme0/code/huggingface/transformers-master-2/src/transformers/trainer.py", line 1569, in _inner_training_loop
    train_dataloader = self.get_train_dataloader()
  File "/mnt/nvme0/code/huggingface/transformers-master-2/src/transformers/trainer.py", line 835, in get_train_dataloader
    train_dataset = self._remove_unused_columns(train_dataset, description="training")
  File "/mnt/nvme0/code/huggingface/transformers-master-2/src/transformers/trainer.py", line 711, in _remove_unused_columns
    ignored_columns = list(set(dataset.column_names) - set(signature_columns))
  File "/mnt/nvme0/code/huggingface/datasets-master/src/datasets/arrow_dataset.py", line 1673, in column_names
    return self._data.column_names
AttributeError: 'Seq2SeqDataset' object has no attribute '_data'

I dumped your Seq2SeqDataset into your main script (trainer.py) so the line numbers won't match with your original main script.

Also does the problem still occur if you use a much smaller ul2 model? e.g. I'm trying with yhavinga/ul2-small-dutch-english - at this point we don't care for outcome, just to reproduce your problem.

I'm trying on 1 gpu first:

rm -rf save_dir; CUDA_VISIBLE_DEVICES=0 deepspeed train.py \
    --name ul2-test \
    --model yhavinga/ul2-small-dutch-english \
    --train demo_train.json \
    --val demo_val.json \
    --max_input_len 128 \
    --max_target_len 512 \
    --save_dir save_dir \
    --num_epochs 3 \
    --learning_rate 2e-4 \
    --eval_steps 3000 \
    --gradient_accumulation_steps 8 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --deepspeed ds_config.json \
    --bf16

Let's try to come up with the smallest possible set up that reproduces the issue, then it'll be easy to debug.

michaelroyzen commented 1 year ago

I've just tested the scripts with flan-t5-small (ul2-small-dutch-english had an odd CUDA error, a different one than the one described above). Additionally, ul2-small-dutch-english is not a representative example as it uses a different activation function from google's UL2 (gated gelu vs gated silu).

Please refer to my S3 bucket for my scripts that I've confirmed run on my machine and their corresponding directory structure.

train.py (https://phind-demo.s3.amazonaws.com/train.py)
ds_config.json (https://phind-demo.s3.amazonaws.com/ds_config.json)
utils folder containing dataset_formats.py, which has the Seq2SeqDataset class (https://phind-demo.s3.amazonaws.com/utils/dataset_formats.py)

With these exact files and the latest version of transformers/datasets, I've just been able to run:

deepspeed train.py \
    --name ul2-test \
    --model google/flan-t5-small \
    --train demo_train.json \
    --val demo_val.json \
    --max_input_len 128 \
    --max_target_len 512 \
    --save_dir save_dir \
    --num_epochs 3 \
    --learning_rate 2e-4 \
    --eval_steps 3000 \
    --gradient_accumulation_steps 8 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --deepspeed ds_config.json \
    --bf16

But any of the UL2 models get CUDA errors. Would appreciate your help.

Thanks!

stas00 commented 1 year ago

Thank you, Michael. With this last version of your code I can run the example you shared.

OK, so what is the smallest UL2 model do you still see the problem with? https://huggingface.co/models?sort=downloads&search=ul2

I run the above code on --model Finnish-NLP/ul2-small-nl24-finnish on 1 and 2 gpus and had no problem.

Additionally, once you try a smaller ul2 model, do you get the same problem with a. 1 gpu b. 2 gpus?

michaelroyzen commented 1 year ago

Running with --model Finnish-NLP/ul2-small-nl24-finnish works for me as well with any number of gpus (from 1 to 8).

But I don't think it's representative because it uses a different activation function than google/ul2. Unfortunately there are no "real" smaller UL2 models, unlike the flan-t5 series where everything is the same except for scale.

UPDATE: I take that back. yhavinga/ul2-base-en-nl also uses gated-silu. Running that experiment now.

michaelroyzen commented 1 year ago

Running

deepspeed train.py \
    --name ul2-test \
    --model yhavinga/ul2-base-en-nl \
    --train demo_train.json \
    --val demo_val.json \
    --max_input_len 128 \
    --max_target_len 512 \
    --save_dir save_dir \
    --num_epochs 3 \
    --learning_rate 2e-4 \
    --eval_steps 3000 \
    --gradient_accumulation_steps 8 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --deepspeed ds_config.json \
    --bf16

on 8 gpus, I got

RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/michael/train.py:164 in <module>                                                           │
│                                                                                                  │
│   161 │   trainer.prepare_datsets_for_training()                                                 │
│   162 │                                                                                          │
│   163 │   # perform training                                                                     │
│ ❱ 164 │   trained_model = trainer.train_model()                                                  │
│   165                                                                                            │
│                                                                                                  │
│ /home/michael/train.py:77 in train_model                                                         │
│                                                                                                  │
│    74 │   │   print("Rank {} reached barrier 2".format(torch.distributed.get_rank()))            │
│    75 │   │   torch.distributed.barrier()                                                        │
│    76 │   │                                                                                      │
│ ❱  77 │   │   self.trainer.train()                                                               │
│    78 │   │                                                                                      │
│    79 │   │   if torch.distributed.get_rank() == 0:                                              │
│    80 │   │   │   trainer.save(self.args.save_dir + '/final_model')                              │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/transformers/trainer.py:1531 in train                     │
│                                                                                                  │
│   1528 │   │   │   args=args,                                                                    │
│   1529 │   │   │   resume_from_checkpoint=resume_from_checkpoint,                                │
│   1530 │   │   │   trial=trial,                                                                  │
│ ❱ 1531 │   │   │   ignore_keys_for_eval=ignore_keys_for_eval,                                    │
│   1532 │   │   )                                                                                 │
│   1533 │                                                                                         │
│   1534 │   def _inner_training_loop(                                                             │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/transformers/trainer.py:1775 in _inner_training_loop      │
│                                                                                                  │
│   1772 │   │   │   │   │   with model.no_sync():                                                 │
│   1773 │   │   │   │   │   │   tr_loss_step = self.training_step(model, inputs)                  │
│   1774 │   │   │   │   else:                                                                     │
│ ❱ 1775 │   │   │   │   │   tr_loss_step = self.training_step(model, inputs)                      │
│   1776 │   │   │   │                                                                             │
│   1777 │   │   │   │   if (                                                                      │
│   1778 │   │   │   │   │   args.logging_nan_inf_filter                                           │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/transformers/trainer.py:2523 in training_step             │
│                                                                                                  │
│   2520 │   │   │   return loss_mb.reduce_mean().detach().to(self.args.device)                    │
│   2521 │   │                                                                                     │
│   2522 │   │   with self.compute_loss_context_manager():                                         │
│ ❱ 2523 │   │   │   loss = self.compute_loss(model, inputs)                                       │
│   2524 │   │                                                                                     │
│   2525 │   │   if self.args.n_gpu > 1:                                                           │
│   2526 │   │   │   loss = loss.mean()  # mean() to average on multi-gpu parallel training        │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/transformers/trainer.py:2555 in compute_loss              │
│                                                                                                  │
│   2552 │   │   │   labels = inputs.pop("labels")                                                 │
│   2553 │   │   else:                                                                             │
│   2554 │   │   │   labels = None                                                                 │
│ ❱ 2555 │   │   outputs = model(**inputs)                                                         │
│   2556 │   │   # Save past state if it exists                                                    │
│   2557 │   │   # TODO: this needs to be fixed and made cleaner later.                            │
│   2558 │   │   if self.args.past_index >= 0:                                                     │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py:1194 in _call_impl             │
│                                                                                                  │
│   1191 │   │   # this function, and just call forward.                                           │
│   1192 │   │   if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks o  │
│   1193 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks):                   │
│ ❱ 1194 │   │   │   return forward_call(*input, **kwargs)                                         │
│   1195 │   │   # Do not call functions when jit is used                                          │
│   1196 │   │   full_backward_hooks, non_full_backward_hooks = [], []                             │
│   1197 │   │   if self._backward_hooks or _global_backward_hooks:                                │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/deepspeed/utils/nvtx.py:11 in wrapped_fn                  │
│                                                                                                  │
│    8 │   │                                                                                       │
│    9 │   │   def wrapped_fn(*args, **kwargs):                                                    │
│   10 │   │   │   with torch.cuda.nvtx.range(func.__qualname__):                                  │
│ ❱ 11 │   │   │   │   return func(*args, **kwargs)                                                │
│   12 │   │                                                                                       │
│   13 │   │   return wrapped_fn                                                                   │
│   14 │   else:                                                                                   │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/deepspeed/runtime/engine.py:1727 in forward               │
│                                                                                                  │
│   1724 │   │   if self.fp16_auto_cast():                                                         │
│   1725 │   │   │   inputs = self._cast_inputs_half(inputs)                                       │
│   1726 │   │                                                                                     │
│ ❱ 1727 │   │   loss = self.module(*inputs, **kwargs)                                             │
│   1728 │   │                                                                                     │
│   1729 │   │   if self.zero_optimization_partition_weights():                                    │
│   1730 │   │   │   # Disable automated discovery of external parameters                          │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py:1212 in _call_impl             │
│                                                                                                  │
│   1209 │   │   │   bw_hook = hooks.BackwardHook(self, full_backward_hooks)                       │
│   1210 │   │   │   input = bw_hook.setup_input_hook(input)                                       │
│   1211 │   │                                                                                     │
│ ❱ 1212 │   │   result = forward_call(*input, **kwargs)                                           │
│   1213 │   │   if _global_forward_hooks or self._forward_hooks:                                  │
│   1214 │   │   │   for hook in (*_global_forward_hooks.values(), *self._forward_hooks.values())  │
│   1215 │   │   │   │   hook_result = hook(self, input, result)                                   │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/transformers/models/t5/modeling_t5.py:1618 in forward     │
│                                                                                                  │
│   1615 │   │   │   │   head_mask=head_mask,                                                      │
│   1616 │   │   │   │   output_attentions=output_attentions,                                      │
│   1617 │   │   │   │   output_hidden_states=output_hidden_states,                                │
│ ❱ 1618 │   │   │   │   return_dict=return_dict,                                                  │
│   1619 │   │   │   )                                                                             │
│   1620 │   │   elif return_dict and not isinstance(encoder_outputs, BaseModelOutput):            │
│   1621 │   │   │   encoder_outputs = BaseModelOutput(                                            │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py:1212 in _call_impl             │
│                                                                                                  │
│   1209 │   │   │   bw_hook = hooks.BackwardHook(self, full_backward_hooks)                       │
│   1210 │   │   │   input = bw_hook.setup_input_hook(input)                                       │
│   1211 │   │                                                                                     │
│ ❱ 1212 │   │   result = forward_call(*input, **kwargs)                                           │
│   1213 │   │   if _global_forward_hooks or self._forward_hooks:                                  │
│   1214 │   │   │   for hook in (*_global_forward_hooks.values(), *self._forward_hooks.values())  │
│   1215 │   │   │   │   hook_result = hook(self, input, result)                                   │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/transformers/models/t5/modeling_t5.py:1051 in forward     │
│                                                                                                  │
│   1048 │   │   │   │   │   cross_attn_layer_head_mask=cross_attn_layer_head_mask,                │
│   1049 │   │   │   │   │   past_key_value=past_key_value,                                        │
│   1050 │   │   │   │   │   use_cache=use_cache,                                                  │
│ ❱ 1051 │   │   │   │   │   output_attentions=output_attentions,                                  │
│   1052 │   │   │   │   )                                                                         │
│   1053 │   │   │                                                                                 │
│   1054 │   │   │   # layer_outputs is a tuple with:                                              │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py:1212 in _call_impl             │
│                                                                                                  │
│   1209 │   │   │   bw_hook = hooks.BackwardHook(self, full_backward_hooks)                       │
│   1210 │   │   │   input = bw_hook.setup_input_hook(input)                                       │
│   1211 │   │                                                                                     │
│ ❱ 1212 │   │   result = forward_call(*input, **kwargs)                                           │
│   1213 │   │   if _global_forward_hooks or self._forward_hooks:                                  │
│   1214 │   │   │   for hook in (*_global_forward_hooks.values(), *self._forward_hooks.values())  │
│   1215 │   │   │   │   hook_result = hook(self, input, result)                                   │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/transformers/models/t5/modeling_t5.py:680 in forward      │
│                                                                                                  │
│    677 │   │   │   layer_head_mask=layer_head_mask,                                              │
│    678 │   │   │   past_key_value=self_attn_past_key_value,                                      │
│    679 │   │   │   use_cache=use_cache,                                                          │
│ ❱  680 │   │   │   output_attentions=output_attentions,                                          │
│    681 │   │   )                                                                                 │
│    682 │   │   hidden_states, present_key_value_state = self_attention_outputs[:2]               │
│    683 │   │   attention_outputs = self_attention_outputs[2:]  # Keep self-attention outputs an  │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py:1212 in _call_impl             │
│                                                                                                  │
│   1209 │   │   │   bw_hook = hooks.BackwardHook(self, full_backward_hooks)                       │
│   1210 │   │   │   input = bw_hook.setup_input_hook(input)                                       │
│   1211 │   │                                                                                     │
│ ❱ 1212 │   │   result = forward_call(*input, **kwargs)                                           │
│   1213 │   │   if _global_forward_hooks or self._forward_hooks:                                  │
│   1214 │   │   │   for hook in (*_global_forward_hooks.values(), *self._forward_hooks.values())  │
│   1215 │   │   │   │   hook_result = hook(self, input, result)                                   │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/transformers/models/t5/modeling_t5.py:586 in forward      │
│                                                                                                  │
│    583 │   │   │   layer_head_mask=layer_head_mask,                                              │
│    584 │   │   │   past_key_value=past_key_value,                                                │
│    585 │   │   │   use_cache=use_cache,                                                          │
│ ❱  586 │   │   │   output_attentions=output_attentions,                                          │
│    587 │   │   )                                                                                 │
│    588 │   │   hidden_states = hidden_states + self.dropout(attention_output[0])                 │
│    589 │   │   outputs = (hidden_states,) + attention_output[1:]  # add attentions if we output  │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py:1212 in _call_impl             │
│                                                                                                  │
│   1209 │   │   │   bw_hook = hooks.BackwardHook(self, full_backward_hooks)                       │
│   1210 │   │   │   input = bw_hook.setup_input_hook(input)                                       │
│   1211 │   │                                                                                     │
│ ❱ 1212 │   │   result = forward_call(*input, **kwargs)                                           │
│   1213 │   │   if _global_forward_hooks or self._forward_hooks:                                  │
│   1214 │   │   │   for hook in (*_global_forward_hooks.values(), *self._forward_hooks.values())  │
│   1215 │   │   │   │   hook_result = hook(self, input, result)                                   │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/transformers/models/t5/modeling_t5.py:498 in forward      │
│                                                                                                  │
│    495 │   │   │   return hidden_states                                                          │
│    496 │   │                                                                                     │
│    497 │   │   # get query states                                                                │
│ ❱  498 │   │   query_states = shape(self.q(hidden_states))  # (batch_size, n_heads, seq_length,  │
│    499 │   │                                                                                     │
│    500 │   │   # get key/value states                                                            │
│    501 │   │   key_states = project(                                                             │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py:1212 in _call_impl             │
│                                                                                                  │
│   1209 │   │   │   bw_hook = hooks.BackwardHook(self, full_backward_hooks)                       │
│   1210 │   │   │   input = bw_hook.setup_input_hook(input)                                       │
│   1211 │   │                                                                                     │
│ ❱ 1212 │   │   result = forward_call(*input, **kwargs)                                           │
│   1213 │   │   if _global_forward_hooks or self._forward_hooks:                                  │
│   1214 │   │   │   for hook in (*_global_forward_hooks.values(), *self._forward_hooks.values())  │
│   1215 │   │   │   │   hook_result = hook(self, input, result)                                   │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/torch/nn/modules/linear.py:114 in forward                 │
│                                                                                                  │
│   111 │   │   │   init.uniform_(self.bias, -bound, bound)                                        │
│   112 │                                                                                          │
│   113 │   def forward(self, input: Tensor) -> Tensor:                                            │
│ ❱ 114 │   │   return F.linear(input, self.weight, self.bias)                                     │
│   115 │                                                                                          │
│   116 │   def extra_repr(self) -> str:                                                           │
│   117 │   │   return 'in_features={}, out_features={}, bias={}'.format(                          │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/deepspeed/runtime/zero/linear.py:116 in zero3_linear_wrap │
│                                                                                                  │
│   113                                                                                            │
│   114 def zero3_linear_wrap(input, weight, bias=None):                                           │
│   115 │   if bias is None:                                                                       │
│ ❱ 116 │   │   return LinearFunctionForZeroStage3.apply(input, weight)                            │
│   117 │   else:                                                                                  │
│   118 │   │   return LinearFunctionForZeroStage3.apply(input, weight, bias)                      │
│   119                                                                                            │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/torch/cuda/amp/autocast_mode.py:97 in decorate_fwd        │
│                                                                                                  │
│    94 │   def decorate_fwd(*args, **kwargs):                                                     │
│    95 │   │   if cast_inputs is None:                                                            │
│    96 │   │   │   args[0]._fwd_used_autocast = torch.is_autocast_enabled()                       │
│ ❱  97 │   │   │   return fwd(*args, **kwargs)                                                    │
│    98 │   │   else:                                                                              │
│    99 │   │   │   autocast_context = torch.is_autocast_enabled()                                 │
│   100 │   │   │   args[0]._fwd_used_autocast = False                                             │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/deepspeed/runtime/zero/linear.py:61 in forward            │
│                                                                                                  │
│    58 │   │   │   # fused op is marginally faster                                                │
│    59 │   │   │   ret = torch.addmm(bias, input, weight.t())                                     │
│    60 │   │   else:                                                                              │
│ ❱  61 │   │   │   output = input.matmul(weight.t())                                              │
│    62 │   │   │   if bias is not None:                                                           │
│    63 │   │   │   │   output += bias                                                             │
│    64 │   │   │   ret = output                                                                   │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`

Running with CUDA_VISIBLE_DEVICES=0, I get a slightly different error:

─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/michael/train.py:164 in <module>                                                           │
│                                                                                                  │
│   161 │   trainer.prepare_datsets_for_training()                                                 │
│   162 │                                                                                          │
│   163 │   # perform training                                                                     │
│ ❱ 164 │   trained_model = trainer.train_model()                                                  │
│   165                                                                                            │
│                                                                                                  │
│ /home/michael/train.py:77 in train_model                                                         │
│                                                                                                  │
│    74 │   │   print("Rank {} reached barrier 2".format(torch.distributed.get_rank()))            │
│    75 │   │   torch.distributed.barrier()                                                        │
│    76 │   │                                                                                      │
│ ❱  77 │   │   self.trainer.train()                                                               │
│    78 │   │                                                                                      │
│    79 │   │   if torch.distributed.get_rank() == 0:                                              │
│    80 │   │   │   trainer.save(self.args.save_dir + '/final_model')                              │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/transformers/trainer.py:1531 in train                     │
│                                                                                                  │
│   1528 │   │   │   args=args,                                                                    │
│   1529 │   │   │   resume_from_checkpoint=resume_from_checkpoint,                                │
│   1530 │   │   │   trial=trial,                                                                  │
│ ❱ 1531 │   │   │   ignore_keys_for_eval=ignore_keys_for_eval,                                    │
│   1532 │   │   )                                                                                 │
│   1533 │                                                                                         │
│   1534 │   def _inner_training_loop(                                                             │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/transformers/trainer.py:1775 in _inner_training_loop      │
│                                                                                                  │
│   1772 │   │   │   │   │   with model.no_sync():                                                 │
│   1773 │   │   │   │   │   │   tr_loss_step = self.training_step(model, inputs)                  │
│   1774 │   │   │   │   else:                                                                     │
│ ❱ 1775 │   │   │   │   │   tr_loss_step = self.training_step(model, inputs)                      │
│   1776 │   │   │   │                                                                             │
│   1777 │   │   │   │   if (                                                                      │
│   1778 │   │   │   │   │   args.logging_nan_inf_filter                                           │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/transformers/trainer.py:2523 in training_step             │
│                                                                                                  │
│   2520 │   │   │   return loss_mb.reduce_mean().detach().to(self.args.device)                    │
│   2521 │   │                                                                                     │
│   2522 │   │   with self.compute_loss_context_manager():                                         │
│ ❱ 2523 │   │   │   loss = self.compute_loss(model, inputs)                                       │
│   2524 │   │                                                                                     │
│   2525 │   │   if self.args.n_gpu > 1:                                                           │
│   2526 │   │   │   loss = loss.mean()  # mean() to average on multi-gpu parallel training        │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/transformers/trainer.py:2555 in compute_loss              │
│                                                                                                  │
│   2552 │   │   │   labels = inputs.pop("labels")                                                 │
│   2553 │   │   else:                                                                             │
│   2554 │   │   │   labels = None                                                                 │
│ ❱ 2555 │   │   outputs = model(**inputs)                                                         │
│   2556 │   │   # Save past state if it exists                                                    │
│   2557 │   │   # TODO: this needs to be fixed and made cleaner later.                            │
│   2558 │   │   if self.args.past_index >= 0:                                                     │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py:1194 in _call_impl             │
│                                                                                                  │
│   1191 │   │   # this function, and just call forward.                                           │
│   1192 │   │   if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks o  │
│   1193 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks):                   │
│ ❱ 1194 │   │   │   return forward_call(*input, **kwargs)                                         │
│   1195 │   │   # Do not call functions when jit is used                                          │
│   1196 │   │   full_backward_hooks, non_full_backward_hooks = [], []                             │
│   1197 │   │   if self._backward_hooks or _global_backward_hooks:                                │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/deepspeed/utils/nvtx.py:11 in wrapped_fn                  │
│                                                                                                  │
│    8 │   │                                                                                       │
│    9 │   │   def wrapped_fn(*args, **kwargs):                                                    │
│   10 │   │   │   with torch.cuda.nvtx.range(func.__qualname__):                                  │
│ ❱ 11 │   │   │   │   return func(*args, **kwargs)                                                │
│   12 │   │                                                                                       │
│   13 │   │   return wrapped_fn                                                                   │
│   14 │   else:                                                                                   │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/deepspeed/runtime/engine.py:1727 in forward               │
│                                                                                                  │
│   1724 │   │   if self.fp16_auto_cast():                                                         │
│   1725 │   │   │   inputs = self._cast_inputs_half(inputs)                                       │
│   1726 │   │                                                                                     │
│ ❱ 1727 │   │   loss = self.module(*inputs, **kwargs)                                             │
│   1728 │   │                                                                                     │
│   1729 │   │   if self.zero_optimization_partition_weights():                                    │
│   1730 │   │   │   # Disable automated discovery of external parameters                          │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py:1212 in _call_impl             │
│                                                                                                  │
│   1209 │   │   │   bw_hook = hooks.BackwardHook(self, full_backward_hooks)                       │
│   1210 │   │   │   input = bw_hook.setup_input_hook(input)                                       │
│   1211 │   │                                                                                     │
│ ❱ 1212 │   │   result = forward_call(*input, **kwargs)                                           │
│   1213 │   │   if _global_forward_hooks or self._forward_hooks:                                  │
│   1214 │   │   │   for hook in (*_global_forward_hooks.values(), *self._forward_hooks.values())  │
│   1215 │   │   │   │   hook_result = hook(self, input, result)                                   │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/transformers/models/t5/modeling_t5.py:1618 in forward     │
│                                                                                                  │
│   1615 │   │   │   │   head_mask=head_mask,                                                      │
│   1616 │   │   │   │   output_attentions=output_attentions,                                      │
│   1617 │   │   │   │   output_hidden_states=output_hidden_states,                                │
│ ❱ 1618 │   │   │   │   return_dict=return_dict,                                                  │
│   1619 │   │   │   )                                                                             │
│   1620 │   │   elif return_dict and not isinstance(encoder_outputs, BaseModelOutput):            │
│   1621 │   │   │   encoder_outputs = BaseModelOutput(                                            │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py:1212 in _call_impl             │
│                                                                                                  │
│   1209 │   │   │   bw_hook = hooks.BackwardHook(self, full_backward_hooks)                       │
│   1210 │   │   │   input = bw_hook.setup_input_hook(input)                                       │
│   1211 │   │                                                                                     │
│ ❱ 1212 │   │   result = forward_call(*input, **kwargs)                                           │
│   1213 │   │   if _global_forward_hooks or self._forward_hooks:                                  │
│   1214 │   │   │   for hook in (*_global_forward_hooks.values(), *self._forward_hooks.values())  │
│   1215 │   │   │   │   hook_result = hook(self, input, result)                                   │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/transformers/models/t5/modeling_t5.py:1051 in forward     │
│                                                                                                  │
│   1048 │   │   │   │   │   cross_attn_layer_head_mask=cross_attn_layer_head_mask,                │
│   1049 │   │   │   │   │   past_key_value=past_key_value,                                        │
│   1050 │   │   │   │   │   use_cache=use_cache,                                                  │
│ ❱ 1051 │   │   │   │   │   output_attentions=output_attentions,                                  │
│   1052 │   │   │   │   )                                                                         │
│   1053 │   │   │                                                                                 │
│   1054 │   │   │   # layer_outputs is a tuple with:                                              │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py:1212 in _call_impl             │
│                                                                                                  │
│   1209 │   │   │   bw_hook = hooks.BackwardHook(self, full_backward_hooks)                       │
│   1210 │   │   │   input = bw_hook.setup_input_hook(input)                                       │
│   1211 │   │                                                                                     │
│ ❱ 1212 │   │   result = forward_call(*input, **kwargs)                                           │
│   1213 │   │   if _global_forward_hooks or self._forward_hooks:                                  │
│   1214 │   │   │   for hook in (*_global_forward_hooks.values(), *self._forward_hooks.values())  │
│   1215 │   │   │   │   hook_result = hook(self, input, result)                                   │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/transformers/models/t5/modeling_t5.py:680 in forward      │
│                                                                                                  │
│    677 │   │   │   layer_head_mask=layer_head_mask,                                              │
│    678 │   │   │   past_key_value=self_attn_past_key_value,                                      │
│    679 │   │   │   use_cache=use_cache,                                                          │
│ ❱  680 │   │   │   output_attentions=output_attentions,                                          │
│    681 │   │   )                                                                                 │
│    682 │   │   hidden_states, present_key_value_state = self_attention_outputs[:2]               │
│    683 │   │   attention_outputs = self_attention_outputs[2:]  # Keep self-attention outputs an  │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py:1212 in _call_impl             │
│                                                                                                  │
│   1209 │   │   │   bw_hook = hooks.BackwardHook(self, full_backward_hooks)                       │
│   1210 │   │   │   input = bw_hook.setup_input_hook(input)                                       │
│   1211 │   │                                                                                     │
│ ❱ 1212 │   │   result = forward_call(*input, **kwargs)                                           │
│   1213 │   │   if _global_forward_hooks or self._forward_hooks:                                  │
│   1214 │   │   │   for hook in (*_global_forward_hooks.values(), *self._forward_hooks.values())  │
│   1215 │   │   │   │   hook_result = hook(self, input, result)                                   │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/transformers/models/t5/modeling_t5.py:586 in forward      │
│                                                                                                  │
│    583 │   │   │   layer_head_mask=layer_head_mask,                                              │
│    584 │   │   │   past_key_value=past_key_value,                                                │
│    585 │   │   │   use_cache=use_cache,                                                          │
│ ❱  586 │   │   │   output_attentions=output_attentions,                                          │
│    587 │   │   )                                                                                 │
│    588 │   │   hidden_states = hidden_states + self.dropout(attention_output[0])                 │
│    589 │   │   outputs = (hidden_states,) + attention_output[1:]  # add attentions if we output  │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py:1212 in _call_impl             │
│                                                                                                  │
│   1209 │   │   │   bw_hook = hooks.BackwardHook(self, full_backward_hooks)                       │
│   1210 │   │   │   input = bw_hook.setup_input_hook(input)                                       │
│   1211 │   │                                                                                     │
│ ❱ 1212 │   │   result = forward_call(*input, **kwargs)                                           │
│   1213 │   │   if _global_forward_hooks or self._forward_hooks:                                  │
│   1214 │   │   │   for hook in (*_global_forward_hooks.values(), *self._forward_hooks.values())  │
│   1215 │   │   │   │   hook_result = hook(self, input, result)                                   │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/transformers/models/t5/modeling_t5.py:498 in forward      │
│                                                                                                  │
│    495 │   │   │   return hidden_states                                                          │
│    496 │   │                                                                                     │
│    497 │   │   # get query states                                                                │
│ ❱  498 │   │   query_states = shape(self.q(hidden_states))  # (batch_size, n_heads, seq_length,  │
│    499 │   │                                                                                     │
│    500 │   │   # get key/value states                                                            │
│    501 │   │   key_states = project(                                                             │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py:1212 in _call_impl             │
│                                                                                                  │
│   1209 │   │   │   bw_hook = hooks.BackwardHook(self, full_backward_hooks)                       │
│   1210 │   │   │   input = bw_hook.setup_input_hook(input)                                       │
│   1211 │   │                                                                                     │
│ ❱ 1212 │   │   result = forward_call(*input, **kwargs)                                           │
│   1213 │   │   if _global_forward_hooks or self._forward_hooks:                                  │
│   1214 │   │   │   for hook in (*_global_forward_hooks.values(), *self._forward_hooks.values())  │
│   1215 │   │   │   │   hook_result = hook(self, input, result)                                   │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/torch/nn/modules/linear.py:114 in forward                 │
│                                                                                                  │
│   111 │   │   │   init.uniform_(self.bias, -bound, bound)                                        │
│   112 │                                                                                          │
│   113 │   def forward(self, input: Tensor) -> Tensor:                                            │
│ ❱ 114 │   │   return F.linear(input, self.weight, self.bias)                                     │
│   115 │                                                                                          │
│   116 │   def extra_repr(self) -> str:                                                           │
│   117 │   │   return 'in_features={}, out_features={}, bias={}'.format(                          │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/deepspeed/runtime/zero/linear.py:116 in zero3_linear_wrap │
│                                                                                                  │
│   113                                                                                            │
│   114 def zero3_linear_wrap(input, weight, bias=None):                                           │
│   115 │   if bias is None:                                                                       │
│ ❱ 116 │   │   return LinearFunctionForZeroStage3.apply(input, weight)                            │
│   117 │   else:                                                                                  │
│   118 │   │   return LinearFunctionForZeroStage3.apply(input, weight, bias)                      │
│   119                                                                                            │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/torch/cuda/amp/autocast_mode.py:97 in decorate_fwd        │
│                                                                                                  │
│    94 │   def decorate_fwd(*args, **kwargs):                                                     │
│    95 │   │   if cast_inputs is None:                                                            │
│    96 │   │   │   args[0]._fwd_used_autocast = torch.is_autocast_enabled()                       │
│ ❱  97 │   │   │   return fwd(*args, **kwargs)                                                    │
│    98 │   │   else:                                                                              │
│    99 │   │   │   autocast_context = torch.is_autocast_enabled()                                 │
│   100 │   │   │   args[0]._fwd_used_autocast = False                                             │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/deepspeed/runtime/zero/linear.py:61 in forward            │
│                                                                                                  │
│    58 │   │   │   # fused op is marginally faster                                                │
│    59 │   │   │   ret = torch.addmm(bias, input, weight.t())                                     │
│    60 │   │   else:                                                                              │
│ ❱  61 │   │   │   output = input.matmul(weight.t())                                              │
│    62 │   │   │   if bias is not None:                                                           │
│    63 │   │   │   │   output += bias                                                             │
│    64 │   │   │   ret = output                                                                   │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`

stas00 commented 1 year ago

super! I'm able to reproduce this on a single gpu and without deepspeed, so deepspeed is not at fault here.

So drop deepspeed, switch to a single gpu and step through with debugger through the first training step.

now using a single gpu and removing deepspeed completely and you will get the same problem.

The problem is indicated by multiple lines of:

../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [43,0,0], thread: [0,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [43,0,0], thread: [1,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [43,0,0], thread: [2,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [43,0,0], thread: [3,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [43,0,0], thread: [4,0,0] Assertion `srcIndex < srcSelectDimSize` failed.

and that usually indicates a bug in the code wrt to tensor indices. Either in your custom code or the trainer.

So after removing deepspeed congif, run otherwise the same cmd line (you can continue using the deepspeed launcher - it has nothing to do with the deepspeed integration)

rm -rf save_dir; CUDA_LAUNCH_BLOCKING=1 CUDA_VISIBLE_DEVICES=0 deepspeed train.py     --name ul2-test     --model yhavinga/ul2-base-en-nl     --train demo_train.json     --val demo_val.json     --max_input_len 128     --max_target_len 512     --save_dir save_dir     --num_epochs 3     --learning_rate 2e-4     --eval_steps 3000     --gradient_accumulation_steps 8     --per_device_train_batch_size 1     --per_device_eval_batch_size 1       --bf16

and you start getting a usable traceback:

Traceback (most recent call last):
  File "train.py", line 167, in <module>
    trained_model = trainer.train_model()
  File "train.py", line 80, in train_model
    self.trainer.train()
  File "/mnt/nvme0/code/huggingface/transformers-master-2/src/transformers/trainer.py", line 1557, in train
    return inner_training_loop(
  File "/mnt/nvme0/code/huggingface/transformers-master-2/src/transformers/trainer.py", line 1808, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/mnt/nvme0/code/huggingface/transformers-master-2/src/transformers/trainer.py", line 2561, in training_step
    loss = self.compute_loss(model, inputs)
  File "/mnt/nvme0/code/huggingface/transformers-master-2/src/transformers/trainer.py", line 2593, in compute_loss
    outputs = model(**inputs)
  File "/home/stas/anaconda3/envs/py38-pt113/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/stas/anaconda3/envs/py38-pt113/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1040, in forward
    output = self._run_ddp_forward(*inputs, **kwargs)
  File "/home/stas/anaconda3/envs/py38-pt113/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1000, in _run_ddp_forward
    return module_to_run(*inputs[0], **kwargs[0])
  File "/home/stas/anaconda3/envs/py38-pt113/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/mnt/nvme0/code/huggingface/transformers-master-2/src/transformers/models/t5/modeling_t5.py", line 1623, in forward
    encoder_outputs = self.encoder(
  File "/home/stas/anaconda3/envs/py38-pt113/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/mnt/nvme0/code/huggingface/transformers-master-2/src/transformers/models/t5/modeling_t5.py", line 1000, in forward
    hidden_states = self.dropout(inputs_embeds)
  File "/home/stas/anaconda3/envs/py38-pt113/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/stas/anaconda3/envs/py38-pt113/lib/python3.8/site-packages/torch/nn/modules/dropout.py", line 59, in forward
    return F.dropout(input, self.p, self.training, self.inplace)
  File "/home/stas/anaconda3/envs/py38-pt113/lib/python3.8/site-packages/torch/nn/functional.py", line 1252, in dropout
    return _VF.dropout_(input, p, training) if inplace else _VF.dropout(input, p, training)
RuntimeError: philox_cuda_state for an unexpected CUDA generator used during capture. In regions captured by CUDA graphs, you may only use the default CUDA RNG generator on the device that's current when capture begins. If you need a non-default (user-supplied) generator, or a generator on another device, please file an issue.

So the failure appears to be inside dropout - Unless you'd like to spend some time with debugger and get to the root of it, it's probably the best to close this issue and start a new one now devoid of deepspeed, and providing all the repro details in the OP and ask the t5 maintainers to figure it out. Most likely it has something to do with the shapes of the tensors or shape manipulation - it's hard to tell w/o a closer look.

I'm currently working on another project, so always happy to jump in on a deepspeed issue which are very rare, but won't have time at the moment to work on other issues.

stas00 commented 1 year ago

I found one report with the same error, but I'm not sure if it's related: https://github.com/pytorch/pytorch/issues/91950

I also was able to reproduce this issue with pt-1.10 and 1.11 - so it's unlikely to be a recent pytorch issue. almost certain something is off in the code.

michaelroyzen commented 1 year ago

Thank you @stas00

stas00 commented 1 year ago

I'm a sucker for a difficult problem, here you go, I stepped with debugger. Have a look at the snapshot - your input_ids are a way too too big: snapshot_81

michaelroyzen commented 1 year ago

Thank you. I see -- how is that possible? Do you think it's a bf16 issue?

Update: the inputs seem to be fine on my end:

{'input_ids': tensor([[ 1150,   268,  2522,   267,  1231,  3634,   263, 32132,  3634,   334,                                                                        
          3113,   264,   314,   279,   321,   316,     1]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]), 'labels': tensor([[430
6,  264,  314,  279,  321,  316,    1]])}                                                                                                                           
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [43,0,0], thread: [32,0,0] Assertion `srcIndex < srcSelectDimSize` failed.             
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [43,0,0], thread: [33,0,0] Assertion `srcIndex < srcSelectDimSize` failed.             
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [43,0,0], thread: [34,0,0] Assertion `srcIndex < srcSelectDimSize` failed.             
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [43,0,0], thread: [35,0,0] Assertion `srcIndex < srcSelectDimSize` failed.             
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [43,0,0], thread: [36,0,0] Assertion `srcIndex < srcSelectDimSize` failed.             
../aten/src/ATen/native/cuda/Indexing

michaelroyzen commented 1 year ago

@younesbelkada Could you take a look please? UL2 is broken by a script that works for flan-t5 and other seq2seq models.

stas00 commented 1 year ago

yes, they are ok at the outputs = model(**inputs) frame and then are borked at the point of dropout, but this happens much sooner,. I will have a look.

It breaks somewhere inside T5Stack.forward

stas00 commented 1 year ago

ok, it has to do with the size of the embedding matrix. In this case it's 32128x768

but your input_ids contain higher numbers than 32128-1:

print(max(input_ids.flatten()))

gives 32132

if I hack your code to do:

        input_ids = input_ids % 32127

then everything works.

Now that you understand what the problem is I trust you can unravel the rest?

Most likely your tokenizer vocab isn't matching the vocab dimension of the embedding matrix.

It's sad that pytorch doesn't give a user friendly error. edit: actually it does on cpu, but not on cuda.

p.s. and the corrupt huge input_ids happened because pytorch blew its head off, but due to the default async nature the body was still thinking it owned a head. That indexSelectLargeIndex cuda error is where things broke first and not where the traceback was showing.

The blowup happened here:

https://github.com/huggingface/transformers/blob/bc44e947f371924db854a460484ec46c95e50a35/src/transformers/models/t5/modeling_t5.py#L954-L956

stas00 commented 1 year ago

The other debug technique is to make gpus disappear and run on cpu, using CUDA_VISIBLE_DEVICES="" env var setting. Usually then you get much better errors.

But not all programs will transparently be able to handle this transition. in the case of your program it doesn't work due to hardcoded gpu code. and some custom gpu kernels will of course not run on cpu.

michaelroyzen commented 1 year ago

Thank you so much, Stas!

michaelroyzen commented 1 year ago

Funny enough, there still is an issue with google/ul2 (the 20B param model) even though the smaller one runs fine now.

[W CUDAGuardImpl.h:124] Warning: CUDA warning: an illegal memory access was encountered (function destroyEvent)                                     
terminate called after throwing an instance of 'c10::Error'                                                                                         
  what():  CUDA error: an illegal memory access was encountered                                                                                     
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.                              
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Could you please take another look?

stas00 commented 1 year ago

but what did you change to fix the smaller one? I hope you didn't use my % hack - it was just to show you what the problem was - it of course wasn't meant to be a solution - apologies if it wasn't obvious.

the larger model is most likely has a different vocab size, so you really need to figure out your setup to read the config correctly and get the tokenizer set up right - usually this is mostly done for you, but this is where you'd check since you wrote your custom code.

First make this small model work correctly w/o hardcoding any numbers - then move onto the large one and most likely it'll just work.

stas00 commented 1 year ago

I'm requesting to make this recurring experience of embedding lookup explosion on cuda to be less painful for the users here: https://github.com/pytorch/pytorch/issues/93880

michaelroyzen commented 1 year ago

I called model.resize_token_embeddings(len(tokenizer)) (which I think is a more general solution than the % hack) and it worked on the smaller model. It doesn't work on the larger model, which has the same vocabulary size of 32128. The CUDA error: an illegal memory access was encountered on the larger model was always different than the one seen on the smaller model. I think something else is going on here.

stas00 commented 1 year ago

It's very possible that you have a multitude of errors. Please ensure that you use the fixed version that you validated working with the smaller model.

I think I have already asked you to show me the full traceback with CUDA_LAUNCH_BLOCKING=1 and it wasn't telling anything useful. this feature is also broken in the recent NCCL versions.

can you share the fixed code?

michaelroyzen commented 1 year ago

Yes, the CUDA traceback is completely useless. I've updated train.py in s3://phind-demo with the latest version.

Here are all the files for your reference (only train.py has been modified):

train.py (https://phind-demo.s3.amazonaws.com/train.py)
ds_config.json (https://phind-demo.s3.amazonaws.com/ds_config.json)
utils folder containing dataset_formats.py, which has the Seq2SeqDataset class (https://phind-demo.s3.amazonaws.com/utils/dataset_formats.py)

I am attempting to run

deepspeed train.py \
    --name ul2-test \
    --model google/ul2 \
    --train demo_train.json \
    --val demo_val.json \
    --max_input_len 128 \
    --max_target_len 512 \
    --save_dir <my save dir> \
    --num_epochs 3 \
    --learning_rate 2e-4 \
    --eval_steps 3000 \
    --gradient_accumulation_steps 1 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --deepspeed ds_config.json

but using --model yhavinga/ul2-base-en-nl now works just fine.

Thanks again.

stas00 commented 1 year ago

ok, I was able to reproduce the problem and figured out the cause and the fix.

This time deepspeed was at fault (not the integration).

The cause is this setting in ds_config.json:

      "sub_group_size": 1e12,

Set it to 1e9 as it's recommended in the docs and everything will work.

It's probably a bug in some deepspeed or pytorch cuda kernel that doesn't check the memory allocations and clearly this one is too big. surely users shouldn't go through such hell because they made an uninformed choice about some obscure optimization settings. (I personally don't understand all of these and thus never touch those, they probably shouldn't even be in the default config file)

To help future users please file a bug report at https://github.com/microsoft/DeepSpeed/issues

And say that when you use

      "sub_group_size": 1e12,

on an 8x 80GB A100 gpu node, deepspeed segfaults, with:

[W CUDAGuardImpl.h:124] Warning: CUDA warning: an illegal memory access was encountered (function destroyEvent)                                            
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.                                     
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:31 (most recent call first):                                          
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f49bd1a6457 in /opt/conda/lib/python3.7/site-packages/torch/lib/libc10.so)        
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f49bd1703ec in /opt/conda/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(std::string const&, std::string const&, int, bool) + 0xb4 (0x7f49bd246c64 in /opt/conda/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x1e0dc (0x7f49bd21e0dc in /opt/conda/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)                                 
frame #4: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x244 (0x7f49bd221054 in /opt/conda/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)   
frame #5: <unknown function> + 0x4f6823 (0x7f49aa4ab823 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_python.so)                            
frame #6: c10::TensorImpl::~TensorImpl() + 0x1a0 (0x7f49bd1869e0 in /opt/conda/lib/python3.7/site-packages/torch/lib/libc10.so)                            
frame #7: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f49bd186af9 in /opt/conda/lib/python3.7/site-packages/torch/lib/libc10.so)                              
frame #8: std::vector<at::Tensor, std::allocator<at::Tensor> >::~vector() + 0x8b (0x7f49aa4add1b in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #9: c10d::ProcessGroupNCCL::WorkNCCL::~WorkNCCL() + 0x8c (0x7f491bf3ae8c in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cuda_cpp.so)   
frame #10: c10d::ProcessGroupNCCL::WorkNCCL::~WorkNCCL() + 0x9 (0x7f491bf3b349 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cuda_cpp.so)   
frame #11: <unknown function> + 0xbe302c (0x7f49aab9802c in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_python.so)                           
frame #12: <unknown function> + 0x3e4272 (0x7f49aa399272 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_python.so)                           
frame #13: <unknown function> + 0x3e51af (0x7f49aa39a1af in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_python.so)                           
frame #14: <unknown function> + 0xe5698 (0x56347c18d698 in /opt/conda/bin/python3.7)                                                                       
frame #15: <unknown function> + 0x1f7b89 (0x56347c29fb89 in /opt/conda/bin/python3.7)                                                                      
frame #16: <unknown function> + 0x191de8 (0x56347c239de8 in /opt/conda/bin/python3.7)                                                                      
frame #17: _PyEval_EvalFrameDefault + 0x4c8a (0x56347c26426a in /opt/conda/bin/python3.7)                                                                  
frame #18: _PyFunction_FastCallKeywords + 0x184 (0x56347c1d73d4 in /opt/conda/bin/python3.7)                                                               
frame #19: <unknown function> + 0x191de8 (0x56347c239de8 in /opt/conda/bin/python3.7)                                                                      
frame #20: _PyEval_EvalFrameDefault + 0xb2a (0x56347c26010a in /opt/conda/bin/python3.7)                                                                   
frame #21: <unknown function> + 0x1f7b66 (0x56347c29fb66 in /opt/conda/bin/python3.7)                                                                      
frame #22: _PyFunction_FastCallDict + 0xaef (0x56347c1a78cf in /opt/conda/bin/python3.7)                                                                   
frame #23: _PyEval_EvalFrameDefault + 0x1f86 (0x56347c261566 in /opt/conda/bin/python3.7)                                                                  
frame #24: _PyEval_EvalCodeWithName + 0x33d (0x56347c1a5ccd in /opt/conda/bin/python3.7)                                                                   
frame #25: _PyFunction_FastCallKeywords + 0x320 (0x56347c1d7570 in /opt/conda/bin/python3.7)                                                               
frame #26: <unknown function> + 0x191de8 (0x56347c239de8 in /opt/conda/bin/python3.7)                                                                      
frame #27: _PyEval_EvalFrameDefault + 0x4c8a (0x56347c26426a in /opt/conda/bin/python3.7)                                                                  
frame #28: _PyFunction_FastCallKeywords + 0x184 (0x56347c1d73d4 in /opt/conda/bin/python3.7)                                                               
frame #29: <unknown function> + 0x191de8 (0x56347c239de8 in /opt/conda/bin/python3.7)                                                                      
frame #30: _PyEval_EvalFrameDefault + 0xb2a (0x56347c26010a in /opt/conda/bin/python3.7)                                                                   
frame #31: _PyEval_EvalCodeWithName + 0x33d (0x56347c1a5ccd in /opt/conda/bin/python3.7)                                                                   
frame #32: _PyFunction_FastCallDict + 0x6a0 (0x56347c1a7480 in /opt/conda/bin/python3.7)                                                                   
frame #33: <unknown function> + 0x185ea4 (0x56347c22dea4 in /opt/conda/bin/python3.7)                                                                      
frame #34: _PyObject_FastCallKeywords + 0x18c (0x56347c238b8c in /opt/conda/bin/python3.7)                                                                 
frame #35: <unknown function> + 0x191f79 (0x56347c239f79 in /opt/conda/bin/python3.7)                                                                      
frame #36: _PyEval_EvalFrameDefault + 0x16bb (0x56347c260c9b in /opt/conda/bin/python3.7)                                                                  
frame #37: _PyFunction_FastCallKeywords + 0x184 (0x56347c1d73d4 in /opt/conda/bin/python3.7)                                                               
frame #38: <unknown function> + 0x191de8 (0x56347c239de8 in /opt/conda/bin/python3.7)                                                                      
frame #39: _PyEval_EvalFrameDefault + 0x4c8a (0x56347c26426a in /opt/conda/bin/python3.7)                                                                  
frame #40: _PyFunction_FastCallKeywords + 0x184 (0x56347c1d73d4 in /opt/conda/bin/python3.7)                                                               
frame #41: <unknown function> + 0x191de8 (0x56347c239de8 in /opt/conda/bin/python3.7)                                                                      
frame #42: _PyEval_EvalFrameDefault + 0x4c8a (0x56347c26426a in /opt/conda/bin/python3.7)                                                                  
frame #43: _PyEval_EvalCodeWithName + 0x33d (0x56347c1a5ccd in /opt/conda/bin/python3.7)                                                                   
frame #44: _PyFunction_FastCallDict + 0x6a0 (0x56347c1a7480 in /opt/conda/bin/python3.7)                                                                   
frame #45: <unknown function> + 0x185ea4 (0x56347c22dea4 in /opt/conda/bin/python3.7)                                                                      
frame #46: _PyObject_FastCallKeywords + 0x18c (0x56347c238b8c in /opt/conda/bin/python3.7)                                                                 
frame #47: <unknown function> + 0x191f79 (0x56347c239f79 in /opt/conda/bin/python3.7)                                                                      
frame #48: _PyEval_EvalFrameDefault + 0x16bb (0x56347c260c9b in /opt/conda/bin/python3.7)                                                                  
frame #49: _PyEval_EvalCodeWithName + 0x33d (0x56347c1a5ccd in /opt/conda/bin/python3.7)                                                                   
frame #50: _PyFunction_FastCallDict + 0x6a0 (0x56347c1a7480 in /opt/conda/bin/python3.7)                                                                   
frame #51: _PyEval_EvalFrameDefault + 0x1f86 (0x56347c261566 in /opt/conda/bin/python3.7)                                                                  
frame #52: _PyEval_EvalCodeWithName + 0x33d (0x56347c1a5ccd in /opt/conda/bin/python3.7)                                                                   
frame #53: _PyFunction_FastCallKeywords + 0x320 (0x56347c1d7570 in /opt/conda/bin/python3.7)                                                               
frame #54: <unknown function> + 0x191de8 (0x56347c239de8 in /opt/conda/bin/python3.7)                                                                      
frame #55: _PyEval_EvalFrameDefault + 0x16bb (0x56347c260c9b in /opt/conda/bin/python3.7)                                                                  
frame #56: _PyEval_EvalCodeWithName + 0x33d (0x56347c1a5ccd in /opt/conda/bin/python3.7)                                                                   
frame #57: _PyFunction_FastCallDict + 0x6a0 (0x56347c1a7480 in /opt/conda/bin/python3.7)                                                                   
frame #58: <unknown function> + 0x185a63 (0x56347c22da63 in /opt/conda/bin/python3.7)                                                                      
frame #59: PyObject_Call + 0x6c (0x56347c1b09dc in /opt/conda/bin/python3.7)
frame #60: <unknown function> + 0x21d3e7 (0x56347c2c53e7 in /opt/conda/bin/python3.7)                                                                      
frame #61: _PyObject_FastCallKeywords + 0x3cb (0x56347c238dcb in /opt/conda/bin/python3.7)                                                                 
frame #62: <unknown function> + 0x191f79 (0x56347c239f79 in /opt/conda/bin/python3.7)                                                                      
frame #63: _PyEval_EvalFrameDefault + 0x16bb (0x56347c260c9b in /opt/conda/bin/python3.7)

and when you use use "sub_group_size": 1e12, all works.

Which most likely means the segfault happens when memory gets allocated or immediately after. and some protection against segfault is needed.

And point to this thread for more details.

Offer to provide an easy to run repro details. But perhaps they don't need it and this segfault is a sufficient info for someone who wrote it.

michaelroyzen commented 1 year ago

Thank you so much, Stas. You're right that sub_group_size is 1e9 in the HF DeepSpeed integration docs, but there's a sample config with 1e12 on the DeepSpeed ZeRO doc page (https://www.deepspeed.ai/docs/config-json/#zero-optimizations-for-fp16-training) and I think that's where I got it from. I'll open up an issue in DeepSpeed. Thanks again for going above and beyond.

stas00 commented 1 year ago

Wonderful. And please report that doc issue too. Thank you, Michael.

stas00 commented 1 year ago

For posterity @ngimel kindly shared that pt-1.13.0 and 1.13.1 are buggy wrt to disappearing error messages in cuda, this has been fixed in https://github.com/pytorch/pytorch/issues/91758 - and the fix is already available in nightlies.

So if you run into situations like this Issue and the cuda error is incomprehensible - please use either torch<1.13 which doesn't have this problem or whatever the next version will be: pt-2.0.0 probably (or nightly if you're brave).

huggingface / transformers