Bug when I run on single GPU

kailashg26 commented 1 month ago

Command: tune run lora_finetune_single_device --config llama3_1/8B_lora_single_device Output:

INFO:torchtune.utils._logging:Running LoRAFinetuneRecipeSingleDevice with resolved config:

batch_size: 2
checkpointer:
  _component_: torchtune.training.FullModelHFCheckpointer
  checkpoint_dir: /tmp/Meta-Llama-3.1-8B-Instruct/
  checkpoint_files:
  - model-00001-of-00004.safetensors
  - model-00002-of-00004.safetensors
  - model-00003-of-00004.safetensors
  - model-00004-of-00004.safetensors
  model_type: LLAMA3
  output_dir: /tmp/Meta-Llama-3.1-8B-Instruct/
  recipe_checkpoint: null
compile: false
dataset:
  _component_: torchtune.datasets.alpaca_cleaned_dataset
device: cuda
dtype: bf16
enable_activation_checkpointing: true
epochs: 1
gradient_accumulation_steps: 64
log_every_n_steps: 1
log_peak_memory_stats: false
loss:
  _component_: torchtune.modules.loss.CEWithChunkedOutputLoss
lr_scheduler:
  _component_: torchtune.modules.get_cosine_schedule_with_warmup
  num_warmup_steps: 100
max_steps_per_epoch: null
metric_logger:
  _component_: torchtune.training.metric_logging.DiskLogger
  log_dir: /tmp/lora_finetune_output
model:
  _component_: torchtune.models.llama3_1.lora_llama3_1_8b
  apply_lora_to_mlp: false
  apply_lora_to_output: false
  lora_alpha: 16
  lora_attn_modules:
  - q_proj
  - v_proj
  lora_dropout: 0.0
  lora_rank: 8
optimizer:
  _component_: torch.optim.AdamW
  lr: 0.0003
  weight_decay: 0.01
output_dir: /tmp/lora_finetune_output
profiler:
  _component_: torchtune.training.setup_torch_profiler
  active_steps: 2
  cpu: true
  cuda: true
  enabled: false
  num_cycles: 1
  output_dir: /tmp/lora_finetune_output/profiling_outputs
  profile_memory: false
  record_shapes: true
  wait_steps: 5
  warmup_steps: 5
  with_flops: false
  with_stack: false
resume_from_checkpoint: false
save_adapter_weights_only: false
seed: null
shuffle: true
tokenizer:
  _component_: torchtune.models.llama3.llama3_tokenizer
  max_seq_len: null
  path: /tmp/Meta-Llama-3.1-8B-Instruct/original/tokenizer.model

DEBUG:torchtune.utils._logging:Setting manual seed to local seed 3188944798. Local seed is seed + rank = 3188944798 + 0
Writing logs to /tmp/lora_finetune_output/log_1727379753.txt
Traceback (most recent call last):
  _File "/home/kailash/miniconda3/envs/llm/bin/tune", line 8, in <module>
    sys.exit(main())
  File "/home/kailash/miniconda3/envs/llm/lib/python3.9/site-packages/torchtune/_cli/tune.py", line 49, in main
    parser.run(args)
  File "/home/kailash/miniconda3/envs/llm/lib/python3.9/site-packages/torchtune/_cli/tune.py", line 43, in run
    args.func(args)
  File "/home/kailash/miniconda3/envs/llm/lib/python3.9/site-packages/torchtune/_cli/run.py", line 185, in _run_cmd
    self._run_single_device(args)
  File "/home/kailash/miniconda3/envs/llm/lib/python3.9/site-packages/torchtune/_cli/run.py", line 94, in _run_single_device
    runpy.run_path(str(args.recipe), run_name="__main__")
  File "/home/kailash/miniconda3/envs/llm/lib/python3.9/runpy.py", line 288, in run_path
    return _run_module_code(code, init_globals, run_name,
  File "/home/kailash/miniconda3/envs/llm/lib/python3.9/runpy.py", line 97, in _run_module_code
    _run_code(code, mod_globals, init_globals,
  File "/home/kailash/miniconda3/envs/llm/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/kailash/miniconda3/envs/llm/lib/python3.9/site-packages/recipes/lora_finetune_single_device.py", line 739, in <module>
    sys.exit(recipe_main())
  File "/home/kailash/miniconda3/envs/llm/lib/python3.9/site-packages/torchtune/config/_parse.py", line 99, in wrapper
    sys.exit(recipe_main(conf))
  File "/home/kailash/miniconda3/envs/llm/lib/python3.9/site-packages/recipes/lora_finetune_single_device.py", line 733, in recipe_main
    recipe.setup(cfg=cfg)
  File "/home/kailash/miniconda3/envs/llm/lib/python3.9/site-packages/recipes/lora_finetune_single_device.py", line 215, in setup
    checkpoint_dict = self.load_checkpoint(cfg_checkpointer=cfg.checkpointer)
  File "/home/kailash/miniconda3/envs/llm/lib/python3.9/site-packages/recipes/lora_finetune_single_device.py", line 148, in load_checkpoint
    self._checkpointer = config.instantiate(
  File "/home/kailash/miniconda3/envs/llm/lib/python3.9/site-packages/torchtune/config/_instantiate.py", line 106, in instantiate
    return _instantiate_node(OmegaConf.to_object(config), *args)
  File "/home/kailash/miniconda3/envs/llm/lib/python3.9/site-packages/torchtune/config/_instantiate.py", line 31, in _instantiate_node
    return _create_component(_component_, args, kwargs)
  File "/home/kailash/miniconda3/envs/llm/lib/python3.9/site-packages/torchtune/config/_instantiate.py", line 20, in _create_component
    return _component_(*args, **kwargs)
  File "/home/kailash/miniconda3/envs/llm/lib/python3.9/site-packages/torchtune/training/checkpointing/_checkpointer.py", line 348, in __init__
    self._checkpoint_paths = self._validate_hf_checkpoint_files(checkpoint_files)
  File "/home/kailash/miniconda3/envs/llm/lib/python3.9/site-packages/torchtune/training/checkpointing/_checkpointer.py", line 389, in _validate_hf_checkpoint_files
    checkpoint_path = get_path(self._checkpoint_dir, f)
  File "/home/kailash/miniconda3/envs/llm/lib/python3.9/site-packages/torchtune/training/checkpointing/_utils.py", line 95, in get_path
    raise ValueError(f"No file with name: {filename} found in {input_dir}.")
ValueError: No file with name: model-00001-of-00004.safetensors found in /tmp/Meta-Llama-3.1-8B-Instruct._

Can any help me with this?

joecummings commented 1 month ago

Hi @kailashg26 - did you download the model to the location specified at the top of the config?

tune download meta-llama/Meta-Llama-3.1-8B-Instruct --output-dir /tmp/Meta-Llama-3.1-8B-Instruct --ignore-patterns "original/consolidated.00.pth"

kailashg26 commented 1 month ago

Hi @joecummings

I did try that now. But unfortunately, I get an error:

_tune download: error: Failed to download meta-llama/Meta-Llama-3.1-8B-Instruct with error: 'An error happened while trying to locate the file on the Hub and we cannot find the requested files in the local cache. Please check your connection and try again or make sure your Internet connection is on.' and traceback: Traceback (most recent call last): File "/home/kailash/miniconda3/envs/llm/lib/python3.9/site-packages/huggingface_hub/utils/_http.py", line 406, in hf_raise_for_status response.raise_for_status() File "/home/kailash/miniconda3/envs/llm/lib/python3.9/site-packages/requests/models.py", line 1024, in raise_for_status raise HTTPError(http_error_msg, response=self) requests.exceptions.HTTPError: 403 Client Error: Forbidden for url: https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct/resolve/0e9e39f249a16976918f6564b8830bc894c89659/.gitattributes

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/home/kailash/miniconda3/envs/llm/lib/python3.9/site-packages/huggingface_hub/file_download.py", line 1746, in _get_metadata_or_catch_error metadata = get_hf_file_metadata( File "/home/kailash/miniconda3/envs/llm/lib/python3.9/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn return fn(*args, **kwargs) File "/home/kailash/miniconda3/envs/llm/lib/python3.9/site-packages/huggingface_hub/file_download.py", line 1666, in get_hf_file_metadata r = _request_wrapper( File "/home/kailash/miniconda3/envs/llm/lib/python3.9/site-packages/huggingface_hub/file_download.py", line 364, in _request_wrapper response = _request_wrapper( File "/home/kailash/miniconda3/envs/llm/lib/python3.9/site-packages/huggingface_hub/file_download.py", line 388, in _request_wrapper hf_raise_for_status(response) File "/home/kailash/miniconda3/envs/llm/lib/python3.9/site-packages/huggingface_hub/utils/_http.py", line 468, in hf_raise_for_status raise _format(HfHubHTTPError, message, response) from e huggingface_hub.errors.HfHubHTTPError: (Request ID: Root=1-66f5ccd5-3d403ea47109c46d5f04a9c4;0a3f40fb-d4e3-400c-b9b6-18fd89e310b7)

403 Forbidden: Please enable access to public gated repositories in your fine-grained token settings to view this repository.. Cannot access content at: https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct/resolve/0e9e39f249a16976918f6564b8830bc894c89659/.gitattributes. Make sure your token has the correct permissions.

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/home/kailash/miniconda3/envs/llm/lib/python3.9/site-packages/torchtune/_cli/download.py", line 126, in _download_cmd true_output_dir = snapshot_download( File "/home/kailash/miniconda3/envs/llm/lib/python3.9/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn return fn(*args, kwargs) File "/home/kailash/miniconda3/envs/llm/lib/python3.9/site-packages/huggingface_hub/_snapshot_download.py", line 290, in snapshot_download thread_map( File "/home/kailash/miniconda3/envs/llm/lib/python3.9/site-packages/tqdm/contrib/concurrent.py", line 69, in thread_map return _executor_map(ThreadPoolExecutor, fn, *iterables, *tqdm_kwargs) File "/home/kailash/miniconda3/envs/llm/lib/python3.9/site-packages/tqdm/contrib/concurrent.py", line 51, in _executor_map return list(tqdm_class(ex.map(fn, iterables, chunksize=chunksize), kwargs)) File "/home/kailash/miniconda3/envs/llm/lib/python3.9/site-packages/tqdm/std.py", line 1181, in iter for obj in iterable: File "/home/kailash/miniconda3/envs/llm/lib/python3.9/concurrent/futures/_base.py", line 609, in result_iterator yield fs.pop().result() File "/home/kailash/miniconda3/envs/llm/lib/python3.9/concurrent/futures/_base.py", line 446, in result return self.get_result() File "/home/kailash/miniconda3/envs/llm/lib/python3.9/concurrent/futures/_base.py", line 391, in get_result raise self._exception File "/home/kailash/miniconda3/envs/llm/lib/python3.9/concurrent/futures/thread.py", line 58, in run result = self.fn(*self.args, self.kwargs) File "/home/kailash/miniconda3/envs/llm/lib/python3.9/site-packages/huggingface_hub/_snapshot_download.py", line 264, in _inner_hf_hub_download return hf_hub_download( File "/home/kailash/miniconda3/envs/llm/lib/python3.9/site-packages/huggingface_hub/utils/_deprecation.py", line 101, in inner_f return f(*args, *kwargs) File "/home/kailash/miniconda3/envs/llm/lib/python3.9/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn return fn(args, kwargs) File "/home/kailash/miniconda3/envs/llm/lib/python3.9/site-packages/huggingface_hub/file_download.py", line 1212, in hf_hub_download return _hf_hub_download_to_local_dir( File "/home/kailash/miniconda3/envs/llm/lib/python3.9/site-packages/huggingface_hub/file_download.py", line 1461, in _hf_hub_download_to_local_dir _raise_on_head_call_error(head_call_error, force_download, local_files_only) File "/home/kailash/miniconda3/envs/llm/lib/python3.9/site-packages/huggingface_hub/file_download.py", line 1857, in _raise_on_head_call_error raise LocalEntryNotFoundError( huggingfacehub.errors.LocalEntryNotFoundError: An error happened while trying to locate the file on the Hub and we cannot find the requested files in the local cache. Please check your connection and try again or make sure your Internet connection is on.

joecummings commented 1 month ago

The Llama models are "gated". This simply means you need to fill out some information in order to download the model. This can all be done from the model card page on the Hugging Face Hub: https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct.

After you do this, you should see a tag like below that says you've been granted access to the model. The whole process should take less than 10 minutes.

After this, try the above command again. Should work!

kailashg26 commented 1 month ago

Hi @joecummings I did see that I have been granted access. I had to change some permissions and it worked now. Thanks! Btw, how long does it take to fine-tune with alpaca dataset with llama 3.1? Also, is there any way that I could monitor the model level and system level events like memory, energy, memory bandwidth?

Throughput (tokens per second)
Latency (total response time (TRT)): the number of seconds it takes to output 100 tokens
Latency (time to first chunk (TTFC))?

ebsmothers commented 1 month ago

@kailashg26 training time will depend on a bunch of things, including whether you're running full/LoRA/QLoRA finetune, whether you have activation checkpointing enabled, whether you're performing sample packing, bsz, seq len, how many (and what kind of devices) you're running on, and more. I recently ran some tests and was able to run an epoch of QLoRA training on an A100 (with Llama3, not 3.1, but should be similar) in as fast as 36 minutes (but again, depends on all of the above). You can check out slides 23-29 of this presentation for more details on this. As a starting point, I'd recommend setting compile=True and dataset.packed=True if you want to reduce your training time. For packed dataset you also need to set tokenizer.max_seq_len. This may require some experimentation depending on how much memory you have, you can try e.g. 2048 as a starting point though.

For monitoring system-level metrics, we log tokens/second by default (see here) and will also log peak memory stats if you set log_peak_memory_stats=True. We support different logging backends like WandB, Tensorboard, Comet logger if you use any of those. If you also want to log time-to-first-batch or other custom metrics we don't currently support, I'd recommend copying the recipe, then modifying your local version (happy to provide pointers on where/how any particular metrics should be logged).

kailashg26 commented 1 month ago

Thanks @ebsmothers I'll take a look at that. Thanks for the pointers. I'll try them and get back if I have questions.

kailashg26 commented 1 month ago

Hi @ebsmothers ,

I successfully executed this command: tune run lora_finetune_single_device --config llama3_1/8B_lora_single_device

I'm wondering how do I infer the output? Any suggestions or any documentation that will walk me through? Additionally, I would also like to understand how the workload is balanced on single CPU-GPU system. Like which portion of code is CPU or GPU-bound. It would be great if you can provide me with the appropriate documentation for this :)

Thanks and appreciate the help!

ebsmothers commented 1 month ago

@kailashg26 that depends on what you are trying to do. If the full training loop completed there should be a checkpoint saved on your local filesystem. You can evaluate the quality of your fine-tuned model by using it to generate some text or by evaluating on a common benchmark (e.g. using our integration with EleutherAI's eval harness). You can check out our Llama3 tutorial here, which has sections on both of these (everything in there should be equally applicable to Llama 3.1).

Regarding determining whether code is CPU- or GPU-bound, you can use our integration with the PyTorch profiler. Just set profiler.enabled=True in your config. This will output a trace file that you can then view in e.g. Perfetto. The full set of profiler configurations we support can be seen for example here.

kailashg26 commented 1 month ago

Hi @ebsmothers,

Thanks for your response. I’m trying to understand the interaction between hardware parameters and workload characteristics. This is my first step toward learning and fine-tuning LLMs. I’m currently focusing on parameters that could help me study the trade-offs between training time, power consumption, energy, and cache metrics (including misses) from a systems perspective.

From an algorithmic perspective, I’m working on a resource-constrained device (a single Nvidia 4090 GPU) and am interested in observing trade-offs between these metrics. I was thinking about varying parameters like batch size and sequence length, but I’m not entirely sure how these affect the underlying hardware. Any insights or documentation on these metrics would be really helpful!

I’m also experimenting with data types, using bf16, but I assume switching to fp32 would result in running out of memory when training a Llama 3.1 8B model. I’m also considering different training approaches (full, LoRA, and QLoRA) and how they impact performance and resource utilization.

Since I’m just getting started, are there any other knobs or parameters you’d recommend studying to better understand LLMs from a systems perspective?

Also, could you please help me to find the below two metrics:

Throughput (tokens per second)
Latency (total response time (TRT)): in this case, the number of seconds it takes to output 100 tokens

Thanks in advance.

felipemello1 commented 1 month ago

@kailashg26 , you may want to take a look at this recent talk by Evan showing the impact of some of these trade offs: https://www.youtube.com/watch?v=43X9E25-Qg0

And this about optimization by Jane: https://www.youtube.com/watch?v=xzBcBJ8_rzM

Also, weights and biases if probably your best friend here.

do pip install wandb, create an account on their website, and run your config like this:

tune run lora_finetune_single_device --config llama3_1/8B_lora_single_device \
metric_logger=torchtune.training.metric_logging.WandBLogger \
log_peak_memory_stats=True

That should be a good start

ebsmothers commented 1 month ago

Thanks @felipemello1 for the free PR 😅. @kailashg26 a lot of the techniques from that video are covered in more detail in our docs, and the full slides from that talk can also be found here.

kailashg26 commented 1 month ago

Thanks, this was very helpful! In fine-tuning, say if I want to track down the execution time for different parts of code, could you suggest any ways to do that? or should I just break the code into multiple segments and profile the blocks?

I did enable all the performance logging knobs and observed this in the profiling outputs when I ran the gemma-2B model.

time

Could you please let me know what is the time for cpu and cuda account in this case? Also, how to understand the functions within the screenshot and get some take aways?

Thanks in advance for all the guidance you have been offering. Appreciate it!

felipemello1 commented 1 month ago

Like Evan mentioned, to see cpu and cuda execution time, you would have to

Set profiler.enabled=true
Take the trace and read it in chrome tracing or perfetto. If you are not used to reading traces, this may not be trivial to understand. I don’t have a good resource to share with you, but I imagine you could find something on YouTube “how to read torch traces”

To make things easier, trace only one step, so it produces a small file, and run your model for a few steps (e.g <10), so it’s fast to try it out

kailashg26 commented 1 month ago

Thanks! That worked!

I had a doubt regarding the max seq len being _null_. was there a specific reason for this? what is the default length if takes if I don't set dataset.packed=True.

Thanks!

ebsmothers commented 1 month ago

Hi @kailashg26, max_seq_len=null just means that the tokenizer will not truncate the sequences. If you're running without dataset.packed=True this means that the sequences will be padded to the maximum length in the batch (this is done in the collate fuction).

kailashg26 commented 1 month ago

I see, thanks. What does compile mean? and why does it tend to reduce the training time?

If I understood things right. At the moment, Torchtune currently supports the the following optimizations to reduce the fine-tuning training time right? Parameter-efficient fine-tuning, gradient check-pointing, quantization (bf16), and dataset packing for a single device

Could you please let me know if I missed any other optimizations?

ebsmothers commented 1 month ago

@kailashg26 compile here means torch.compile. It will compile the model graph and fuse various operations so that they can be run faster. I'd recommend checking out these resources if you want to know a bit more: [1], [2]. I think your summary covers most items we use to speed up training (and of course torch.compile).

A couple other small points: reduced-precision optimizers like bitsandbytes.optim.8BitAdamW may also give some slight speedup (though more relevant for full finetune than for e.g. LoRA, where very few parameters are trainable). Also parameter-efficient fine-tuning is not guaranteed to speed up training time universally, e.g. if you are doing QLoRA there are additional upcasts from NF4 to bf16 that need to happen, which makes it a bit slower than vanilla LoRA. But definitely LoRA should be quite a bit faster than full finetuning.

kailashg26 commented 1 month ago

Thanks for the information @ebsmothers. Helps a lot. Appreciate prompt responses as well. So, just curious to understand. Typically when we fine-tune the model with a dataset. The more tokens processed, the better it trains in terms of system-level performance? or do you suggest performing inference on the trained model and check the accuracy?

Also, is there any way that I could find the number of tokens processed per watt? Gaining lot of insights from just a week of experiments! Thanks for interesting presentation and details :)

ebsmothers commented 1 month ago

The more tokens processed, the better it trains in terms of system-level performance? or do you suggest performing inference on the trained model and check the accuracy?

@kailashg26 I think both these things are true. Increasing tokens/sec on a fixed dataset means that your model will be able to learn the same amount in a shorter length of time. But if the dataset is filled with junk, then the more you fine-tune on it the worse your model quality may get.

Also, is there any way that I could find the number of tokens processed per watt?

This I actually don't know offhand. I know WandB will log power usage in their system metrics so you may be able to infer it based on that (I also don't know how they are calculating it though).

kailashg26 commented 1 month ago

Thanks @ebsmothers

Could you please let me know how to track the Latency (total response time (TRT)): in this case, the number of seconds it takes to output 100 tokens?

felipemello1 commented 1 month ago

the number of seconds it takes to output 100 tokens?

@kailashg26 , do you mean during training or generation? If during training, and you have a fixed seq_len, i think you can infer that using TPS, right? But you can always add more logs to your own recipe.

If during generation, we also log it in the recipe. But, you probably wouldnt use our generation recipe as reference. You would probably want to deploy your model using something like vLLM, which is faster.

kailashg26 commented 1 month ago

Hi @felipemello1,

I'm currently fine-tuning LLAMA 3.1 on a single Nvidia RTX 4090. I’ve encountered a few issues and have some questions as I explore the fine-tuning process:

Sequence Length and Memory: I’ve noticed that if I set the max sequence length to 2048, the program runs out of memory. Is there a recommended way to manage this on a single device setup?

Tuning Parameters: In the YAML configuration file, I’ve noticed various knobs and optimization tricks to help speed up training on a domain-specific dataset (I'm currently using the Alpaca dataset).
    Regarding the model parameters, I can adjust the batch size, but I’m not seeing options to vary the input sequence length and output sequence length. How can I tune these parameters in the fine-tuning phase to analyze their impact on system run-time from a systems perspective?

Training Duration: The current epoch is set to 1, and I’ve noticed the loss converges, but I’m wondering if you would recommend training for longer? Is there any documentation or guidelines on suggested parameter setups?

Additional Parameters: Are there any other parameters that might be interesting to tune or analyze from a systems perspective during fine-tuning?

Thanks for the help!

felipemello1 commented 1 month ago

you can lower your memory by:

compile=True
smaller batchsize and increase gradient accumulation (try to have bsz*grad_accumulation between 8 and 64). Start small and if your loss has huge spikes, increase it for stability
Qlora instead of Lora, you can also use a smaller rank (have alpha = 2*rank) and train less layer, eg, q,v only, but usually training the mlp is better
activation checkpointing = true
activation offloading = True
use pytorch and torchtune nightlies (check readme on how to install them)
get more gpus on a remote server

regarding training duration, its hard to say. We dont support early stopping yet (when you run validation after every training epoch and check how well you are doing on the validation set). Ideally, you should keep training until your validation loss plateaus or overfits. Train loss is not the best parameter. You could do a test: train for 1 epoch, train for 3 epochs, run eval on both cases, check the difference. You can also check the issues/pull requests, i believe that someone implemented this on their fork.

kailashg26 commented 1 month ago

And any comments about the input and output sequence length tuning? Could you please let me know how to tune them? I see this flag:shuffle: True for dataset. Does that mean, the dataset is randomly shuffled and sampled using the batch size? Also, could you please let me know the combination of datasets that I could use for training and eval to estimate the accuracy? Just curious - is there any white paper or documentation that explain the complete operational workflow of fine-tuning?

On the eval framework, if we observe accuracy as 0.81. Does that indicate 81%?

pytorch / torchtune

Bug when I run on single GPU #1694