Open abhimasand opened 8 months ago
You can take a look at this post on how to do distributed training: https://www.philschmid.de/sagemaker-fsdp-gpt
Hi @philschmid
I tried follow the guide that you linked and modified by script to include fsdp, however, while running the script, I run into a similar issue as above
raise ImportError(
ImportError: Flash Attention 2.0 is not available. Please refer to the documentation of https://github.com/Dao-AILab/flash-attention for installing it.
Downloading shards: 100%|██████████| 8/8 [05:23<00:00, 35.08s/it]
#015Downloading shards: 100%|██████████| 8/8 [05:23<00:00, 35.08s/it]
Downloading shards: 100%|██████████| 8/8 [05:23<00:00, 40.39s/it]
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 62 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 63 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 64 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 3 (pid: 65) of binary: [/opt/conda/bin/python](https://file+.vscode-resource.vscode-cdn.net/opt/conda/bin/python)
File "/opt/conda/bin/torchrun", line 33, in <module>
sys.exit(load_entry_point('torch==2.0.0', 'console_scripts', 'torchrun')())
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError
Building flash attention 2 is erroring out in both of the distributed settings
What GPUs are you using?
I am using the g5.12xlarge instance on sagemaker, which has 4 A10 gpus (I checked that torchrun did in fact have the correct args after running and the n_processes was 4 and node was 1)
And how are you installing flash attention? You sad you made modifications? are you having the requirements.txt
in you directory and installing it using os.system
in the beginning of the file?
I thought flash attention was installed automatically when you build a huggingface estimator via sagemaker since there was no explicit installation of flash attention in your mistral training script.
Yes I have requirements.txt in my path and installing it as well in the beginning of the file
Thats what installs FA 2
Nope, it's still present.
# upgrade flash attention here
try:
os.system("pip install flash-attn --no-build-isolation --upgrade")
except:
print("flash-attn failed to install")
I checked the logs as well, there was no error message, and the code works until the model load line
model = AutoModelForCausalLM.from_pretrained(
script_args.model_id,
use_cache=False if training_args.gradient_checkpointing else True, # this is needed for gradient checkpointing
device_map="auto",
use_flash_attention_2=script_args.use_flash_attn,
quantization_config=bnb_config,
)
What container/veresions are you using to launch the job
Here is my entire config
model_id = "HuggingFaceH4/zephyr-7b-beta"
hyperparameters ={
'model_id': model_id, # pre-trained model
'dataset_path': '/opt/ml/input/data/training', # path where sagemaker will save training dataset
'num_train_epochs': 6, # number of training epochs
'per_device_train_batch_size': 6, # batch size for training
'gradient_accumulation_steps': 2, # Number of updates steps to accumulate
'gradient_checkpointing': True, # save memory but slower backward pass
'bf16': True, # use bfloat16 precision
'tf32': True, # use tf32 precision
'learning_rate': 2e-4, # learning rate
'max_grad_norm': 0.3, # Maximum norm (for gradient clipping)
'warmup_ratio': 0.03, # warmup ratio
'save_strategy': "epoch", # save strategy for checkpoints
"logging_steps": 10, # log every x steps
'merge_adapters': True, # wether to merge LoRA into the model (needs more memory)
'use_flash_attn': True, # Whether to use Flash Attention
'output_dir': '/tmp/run', # output directory, where to save assets during training
# could be used for checkpointing. The final trained
# model will always be saved to s3 at the end of training
'ddp_timeout': 7200,
'fsdp': '"full_shard auto_wrap"',
}
huggingface_estimator = HuggingFace(
entry_point = 'run_qlora_torchrun.py', # train script
source_dir = '../scripts', # directory which includes all the files needed for training
instance_type = 'ml.g5.12xlarge', # instances type used for the training job
instance_count = 1, # the number of instances used for training
max_run = 2*24*60*60, # maximum runtime in seconds (days * hours * minutes * seconds)
base_job_name = job_name, # the name of the training job
role = role, # Iam role used in training job to access AWS ressources, e.g. S3
volume_size = 300, # the size of the EBS volume in GB
transformers_version = '4.28.1', # the transformers version used in the training job
pytorch_version = '2.0.0', # the pytorch_version version used in the training job
py_version = 'py310', # the python version used in the training job
hyperparameters = hyperparameters, # the hyperparameters passed to the training job
environment = { "HUGGINGFACE_HUB_CACHE": "/tmp/.cache" }, # set env variable to cache models in /tmp
disable_output_compression = True, # not compress output to save training time and cost
distribution={"torch_distributed": {"enabled": True}} # enable torchrun
)
The funny thing is if I just enable the distributed method, if via torchrun or the sagemaker model parallel, I get the flash attention error. I think you may be able to reproduce the error easily yourself as well, by just enabling the distribution on your mistral training script
I ll try and find time for this. Maybe its not possible to install FA
as i did in the script when using distributed setup
Thank you very much, I'll also keep trying from my end and will update here if something works!
You could use the existing container, create a new one and installing FA inside, push it to ECR and and then use it directly.
I'll try to do that, it's this dockerfile that I'll have to rebuild right: https://github.com/aws/deep-learning-containers/blob/master/huggingface/pytorch/training/docker/2.0/py3/cu118/Dockerfile.gpu?
Hi @philschmid,
When I try to increase the chunk length to be greater than 2048, the training fails and runs into an OOM error on g5.4xlarge. Totally makes sense why it's happening, my question is how would you recommend using the g5.12xlarge instance which has 4x the gpus, and consequently 4x the vram to train the model.
I found this resource on HF: https://huggingface.co/docs/sagemaker/train#distributed-training for doing model parallelism, however when I tried using it with the following config,
I ran into the following error
UnexpectedStatusException: Error for Training job huggingface-qlora-HuggingFaceH4-zephyr--2023-11-03-16-29-02-663: Failed. Reason: AlgorithmError: ExecuteUserScriptError: ExitCode 134 ErrorMessage "ERROR: Could not install packages due to an OSError: [Errno 2] No such file or directory: '/opt/conda/lib/python3.10/site-packages/flash_attn-0.2.8.dist-info/'
Is there any way to solve this? And is the model parallelism the method you would recommend to use the g5.12xlarge instance?