Having a greater chunk length than 2048 in packing leads to OOM error

abhimasand commented 8 months ago

Hi @philschmid,

When I try to increase the chunk length to be greater than 2048, the training fails and runs into an OOM error on g5.4xlarge. Totally makes sense why it's happening, my question is how would you recommend using the g5.12xlarge instance which has 4x the gpus, and consequently 4x the vram to train the model.

I found this resource on HF: https://huggingface.co/docs/sagemaker/train#distributed-training for doing model parallelism, however when I tried using it with the following config,

mpi_options = {
    "enabled" : True,
    "processes_per_host" : 4
}

smp_options = {
    "enabled": True,
    "parameters": {
        "ddp": True,
        "ddp_dist_backend": "auto", #OR "nccl" to disable SMDDP Collectives
        "partitions": 2,
    }
}

distribution={
    "smdistributed": {"modelparallel": smp_options},
    "mpi": mpi_options
}

I ran into the following error

UnexpectedStatusException: Error for Training job huggingface-qlora-HuggingFaceH4-zephyr--2023-11-03-16-29-02-663: Failed. Reason: AlgorithmError: ExecuteUserScriptError: ExitCode 134 ErrorMessage "ERROR: Could not install packages due to an OSError: [Errno 2] No such file or directory: '/opt/conda/lib/python3.10/site-packages/flash_attn-0.2.8.dist-info/'

Is there any way to solve this? And is the model parallelism the method you would recommend to use the g5.12xlarge instance?

philschmid commented 8 months ago

You can take a look at this post on how to do distributed training: https://www.philschmid.de/sagemaker-fsdp-gpt

abhimasand commented 8 months ago

Hi @philschmid

I tried follow the guide that you linked and modified by script to include fsdp, however, while running the script, I run into a similar issue as above

raise ImportError(
 ImportError: Flash Attention 2.0 is not available. Please refer to the documentation of https://github.com/Dao-AILab/flash-attention for installing it.
 Downloading shards: 100%|██████████| 8/8 [05:23<00:00, 35.08s/it]
 #015Downloading shards: 100%|██████████| 8/8 [05:23<00:00, 35.08s/it]
 Downloading shards: 100%|██████████| 8/8 [05:23<00:00, 40.39s/it]
 WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 62 closing signal SIGTERM
 WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 63 closing signal SIGTERM
 WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 64 closing signal SIGTERM
 ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 3 (pid: 65) of binary: [/opt/conda/bin/python](https://file+.vscode-resource.vscode-cdn.net/opt/conda/bin/python)
 File "/opt/conda/bin/torchrun", line 33, in <module>
 sys.exit(load_entry_point('torch==2.0.0', 'console_scripts', 'torchrun')())
 File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
 return f(*args, **kwargs)
 File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
 run(args)
 File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
 elastic_launch(
 File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
 return launch_agent(self._config, self._entrypoint, list(args))
 File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
 raise ChildFailedError(
 torch.distributed.elastic.multiprocessing.errors.ChildFailedError

Building flash attention 2 is erroring out in both of the distributed settings

philschmid commented 8 months ago

What GPUs are you using?

abhimasand commented 8 months ago

I am using the g5.12xlarge instance on sagemaker, which has 4 A10 gpus (I checked that torchrun did in fact have the correct args after running and the n_processes was 4 and node was 1)

philschmid commented 7 months ago

And how are you installing flash attention? You sad you made modifications? are you having the requirements.txt in you directory and installing it using os.system in the beginning of the file?

abhimasand commented 7 months ago

I thought flash attention was installed automatically when you build a huggingface estimator via sagemaker since there was no explicit installation of flash attention in your mistral training script.

Yes I have requirements.txt in my path and installing it as well in the beginning of the file

philschmid commented 7 months ago

Did you miss this? https://github.com/philschmid/llm-sagemaker-sample/blob/65a0b04aa7b66c836346316e867e8e3bff27e6de/scripts/run_qlora.py#L6

philschmid commented 7 months ago

Thats what installs FA 2

abhimasand commented 7 months ago

Nope, it's still present.

# upgrade flash attention here
try:
    os.system("pip install flash-attn --no-build-isolation --upgrade")
except:
    print("flash-attn failed to install")

I checked the logs as well, there was no error message, and the code works until the model load line

model = AutoModelForCausalLM.from_pretrained(
        script_args.model_id,
        use_cache=False if training_args.gradient_checkpointing else True,  # this is needed for gradient checkpointing
        device_map="auto",
        use_flash_attention_2=script_args.use_flash_attn,
        quantization_config=bnb_config,
    )

philschmid commented 7 months ago

What container/veresions are you using to launch the job

abhimasand commented 7 months ago

Here is my entire config

model_id = "HuggingFaceH4/zephyr-7b-beta" 

hyperparameters ={
  'model_id': model_id,                             # pre-trained model
  'dataset_path': '/opt/ml/input/data/training',    # path where sagemaker will save training dataset
  'num_train_epochs': 6,                            # number of training epochs
  'per_device_train_batch_size': 6,                 # batch size for training
  'gradient_accumulation_steps': 2,                 # Number of updates steps to accumulate 
  'gradient_checkpointing': True,                   # save memory but slower backward pass
  'bf16': True,                                     # use bfloat16 precision
  'tf32': True,                                     # use tf32 precision
  'learning_rate': 2e-4,                            # learning rate
  'max_grad_norm': 0.3,                             # Maximum norm (for gradient clipping)
  'warmup_ratio': 0.03,                             # warmup ratio
  'save_strategy': "epoch",                         # save strategy for checkpoints
  "logging_steps": 10,                              # log every x steps
  'merge_adapters': True,                           # wether to merge LoRA into the model (needs more memory)
  'use_flash_attn': True,                           # Whether to use Flash Attention
  'output_dir': '/tmp/run',                         # output directory, where to save assets during training
                                                    # could be used for checkpointing. The final trained
                                                    # model will always be saved to s3 at the end of training 
  'ddp_timeout': 7200,
  'fsdp': '"full_shard auto_wrap"',
}

huggingface_estimator = HuggingFace(
    entry_point          = 'run_qlora_torchrun.py',    # train script
    source_dir           = '../scripts',      # directory which includes all the files needed for training
    instance_type        = 'ml.g5.12xlarge',   # instances type used for the training job
    instance_count       = 1,                 # the number of instances used for training
    max_run              = 2*24*60*60,        # maximum runtime in seconds (days * hours * minutes * seconds)
    base_job_name        = job_name,          # the name of the training job
    role                 = role,              # Iam role used in training job to access AWS ressources, e.g. S3
    volume_size          = 300,               # the size of the EBS volume in GB
    transformers_version = '4.28.1',            # the transformers version used in the training job
    pytorch_version      = '2.0.0',             # the pytorch_version version used in the training job
    py_version           = 'py310',           # the python version used in the training job
    hyperparameters      =  hyperparameters,  # the hyperparameters passed to the training job
    environment          = { "HUGGINGFACE_HUB_CACHE": "/tmp/.cache" }, # set env variable to cache models in /tmp
    disable_output_compression = True,         # not compress output to save training time and cost
    distribution={"torch_distributed": {"enabled": True}} # enable torchrun
)

abhimasand commented 7 months ago

The funny thing is if I just enable the distributed method, if via torchrun or the sagemaker model parallel, I get the flash attention error. I think you may be able to reproduce the error easily yourself as well, by just enabling the distribution on your mistral training script

philschmid commented 7 months ago

I ll try and find time for this. Maybe its not possible to install FA as i did in the script when using distributed setup

abhimasand commented 7 months ago

Thank you very much, I'll also keep trying from my end and will update here if something works!

philschmid commented 7 months ago

You could use the existing container, create a new one and installing FA inside, push it to ECR and and then use it directly.

abhimasand commented 7 months ago

I'll try to do that, it's this dockerfile that I'll have to rebuild right: https://github.com/aws/deep-learning-containers/blob/master/huggingface/pytorch/training/docker/2.0/py3/cu118/Dockerfile.gpu?

philschmid / llm-sagemaker-sample

Having a greater chunk length than 2048 in packing leads to OOM error #4