Open kailashg26 opened 1 month ago
Hi @kailashg26 - did you download the model to the location specified at the top of the config?
tune download meta-llama/Meta-Llama-3.1-8B-Instruct --output-dir /tmp/Meta-Llama-3.1-8B-Instruct --ignore-patterns "original/consolidated.00.pth"
Hi @joecummings
I did try that now. But unfortunately, I get an error:
_tune download: error: Failed to download meta-llama/Meta-Llama-3.1-8B-Instruct with error: 'An error happened while trying to locate the file on the Hub and we cannot find the requested files in the local cache. Please check your connection and try again or make sure your Internet connection is on.' and traceback: Traceback (most recent call last): File "/home/kailash/miniconda3/envs/llm/lib/python3.9/site-packages/huggingface_hub/utils/_http.py", line 406, in hf_raise_for_status response.raise_for_status() File "/home/kailash/miniconda3/envs/llm/lib/python3.9/site-packages/requests/models.py", line 1024, in raise_for_status raise HTTPError(http_error_msg, response=self) requests.exceptions.HTTPError: 403 Client Error: Forbidden for url: https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct/resolve/0e9e39f249a16976918f6564b8830bc894c89659/.gitattributes
The above exception was the direct cause of the following exception:
Traceback (most recent call last): File "/home/kailash/miniconda3/envs/llm/lib/python3.9/site-packages/huggingface_hub/file_download.py", line 1746, in _get_metadata_or_catch_error metadata = get_hf_file_metadata( File "/home/kailash/miniconda3/envs/llm/lib/python3.9/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn return fn(*args, **kwargs) File "/home/kailash/miniconda3/envs/llm/lib/python3.9/site-packages/huggingface_hub/file_download.py", line 1666, in get_hf_file_metadata r = _request_wrapper( File "/home/kailash/miniconda3/envs/llm/lib/python3.9/site-packages/huggingface_hub/file_download.py", line 364, in _request_wrapper response = _request_wrapper( File "/home/kailash/miniconda3/envs/llm/lib/python3.9/site-packages/huggingface_hub/file_download.py", line 388, in _request_wrapper hf_raise_for_status(response) File "/home/kailash/miniconda3/envs/llm/lib/python3.9/site-packages/huggingface_hub/utils/_http.py", line 468, in hf_raise_for_status raise _format(HfHubHTTPError, message, response) from e huggingface_hub.errors.HfHubHTTPError: (Request ID: Root=1-66f5ccd5-3d403ea47109c46d5f04a9c4;0a3f40fb-d4e3-400c-b9b6-18fd89e310b7)
403 Forbidden: Please enable access to public gated repositories in your fine-grained token settings to view this repository.. Cannot access content at: https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct/resolve/0e9e39f249a16976918f6564b8830bc894c89659/.gitattributes. Make sure your token has the correct permissions.
The above exception was the direct cause of the following exception:
Traceback (most recent call last): File "/home/kailash/miniconda3/envs/llm/lib/python3.9/site-packages/torchtune/_cli/download.py", line 126, in _download_cmd true_output_dir = snapshot_download( File "/home/kailash/miniconda3/envs/llm/lib/python3.9/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn return fn(*args, kwargs) File "/home/kailash/miniconda3/envs/llm/lib/python3.9/site-packages/huggingface_hub/_snapshot_download.py", line 290, in snapshot_download thread_map( File "/home/kailash/miniconda3/envs/llm/lib/python3.9/site-packages/tqdm/contrib/concurrent.py", line 69, in thread_map return _executor_map(ThreadPoolExecutor, fn, *iterables, *tqdm_kwargs) File "/home/kailash/miniconda3/envs/llm/lib/python3.9/site-packages/tqdm/contrib/concurrent.py", line 51, in _executor_map return list(tqdm_class(ex.map(fn, iterables, chunksize=chunksize), kwargs)) File "/home/kailash/miniconda3/envs/llm/lib/python3.9/site-packages/tqdm/std.py", line 1181, in iter for obj in iterable: File "/home/kailash/miniconda3/envs/llm/lib/python3.9/concurrent/futures/_base.py", line 609, in result_iterator yield fs.pop().result() File "/home/kailash/miniconda3/envs/llm/lib/python3.9/concurrent/futures/_base.py", line 446, in result return self.get_result() File "/home/kailash/miniconda3/envs/llm/lib/python3.9/concurrent/futures/_base.py", line 391, in get_result raise self._exception File "/home/kailash/miniconda3/envs/llm/lib/python3.9/concurrent/futures/thread.py", line 58, in run result = self.fn(*self.args, self.kwargs) File "/home/kailash/miniconda3/envs/llm/lib/python3.9/site-packages/huggingface_hub/_snapshot_download.py", line 264, in _inner_hf_hub_download return hf_hub_download( File "/home/kailash/miniconda3/envs/llm/lib/python3.9/site-packages/huggingface_hub/utils/_deprecation.py", line 101, in inner_f return f(*args, *kwargs) File "/home/kailash/miniconda3/envs/llm/lib/python3.9/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn return fn(args, kwargs) File "/home/kailash/miniconda3/envs/llm/lib/python3.9/site-packages/huggingface_hub/file_download.py", line 1212, in hf_hub_download return _hf_hub_download_to_local_dir( File "/home/kailash/miniconda3/envs/llm/lib/python3.9/site-packages/huggingface_hub/file_download.py", line 1461, in _hf_hub_download_to_local_dir _raise_on_head_call_error(head_call_error, force_download, local_files_only) File "/home/kailash/miniconda3/envs/llm/lib/python3.9/site-packages/huggingface_hub/file_download.py", line 1857, in _raise_on_head_call_error raise LocalEntryNotFoundError( huggingfacehub.errors.LocalEntryNotFoundError: An error happened while trying to locate the file on the Hub and we cannot find the requested files in the local cache. Please check your connection and try again or make sure your Internet connection is on.
The Llama models are "gated". This simply means you need to fill out some information in order to download the model. This can all be done from the model card page on the Hugging Face Hub: https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct.
After you do this, you should see a tag like below that says you've been granted access to the model. The whole process should take less than 10 minutes.
After this, try the above command again. Should work!
Hi @joecummings I did see that I have been granted access. I had to change some permissions and it worked now. Thanks! Btw, how long does it take to fine-tune with alpaca dataset with llama 3.1? Also, is there any way that I could monitor the model level and system level events like memory, energy, memory bandwidth?
@kailashg26 training time will depend on a bunch of things, including whether you're running full/LoRA/QLoRA finetune, whether you have activation checkpointing enabled, whether you're performing sample packing, bsz, seq len, how many (and what kind of devices) you're running on, and more. I recently ran some tests and was able to run an epoch of QLoRA training on an A100 (with Llama3, not 3.1, but should be similar) in as fast as 36 minutes (but again, depends on all of the above). You can check out slides 23-29 of this presentation for more details on this. As a starting point, I'd recommend setting compile=True
and dataset.packed=True
if you want to reduce your training time. For packed dataset you also need to set tokenizer.max_seq_len
. This may require some experimentation depending on how much memory you have, you can try e.g. 2048 as a starting point though.
For monitoring system-level metrics, we log tokens/second by default (see here) and will also log peak memory stats if you set log_peak_memory_stats=True
. We support different logging backends like WandB, Tensorboard, Comet logger if you use any of those. If you also want to log time-to-first-batch or other custom metrics we don't currently support, I'd recommend copying the recipe, then modifying your local version (happy to provide pointers on where/how any particular metrics should be logged).
Thanks @ebsmothers I'll take a look at that. Thanks for the pointers. I'll try them and get back if I have questions.
Hi @ebsmothers ,
I successfully executed this command:
tune run lora_finetune_single_device --config llama3_1/8B_lora_single_device
I'm wondering how do I infer the output? Any suggestions or any documentation that will walk me through? Additionally, I would also like to understand how the workload is balanced on single CPU-GPU system. Like which portion of code is CPU or GPU-bound. It would be great if you can provide me with the appropriate documentation for this :)
Thanks and appreciate the help!
@kailashg26 that depends on what you are trying to do. If the full training loop completed there should be a checkpoint saved on your local filesystem. You can evaluate the quality of your fine-tuned model by using it to generate some text or by evaluating on a common benchmark (e.g. using our integration with EleutherAI's eval harness). You can check out our Llama3 tutorial here, which has sections on both of these (everything in there should be equally applicable to Llama 3.1).
Regarding determining whether code is CPU- or GPU-bound, you can use our integration with the PyTorch profiler. Just set profiler.enabled=True
in your config. This will output a trace file that you can then view in e.g. Perfetto. The full set of profiler configurations we support can be seen for example here.
Hi @ebsmothers,
Thanks for your response. I’m trying to understand the interaction between hardware parameters and workload characteristics. This is my first step toward learning and fine-tuning LLMs. I’m currently focusing on parameters that could help me study the trade-offs between training time, power consumption, energy, and cache metrics (including misses) from a systems perspective.
From an algorithmic perspective, I’m working on a resource-constrained device (a single Nvidia 4090 GPU) and am interested in observing trade-offs between these metrics. I was thinking about varying parameters like batch size and sequence length, but I’m not entirely sure how these affect the underlying hardware. Any insights or documentation on these metrics would be really helpful!
I’m also experimenting with data types, using bf16, but I assume switching to fp32 would result in running out of memory when training a Llama 3.1 8B model. I’m also considering different training approaches (full, LoRA, and QLoRA) and how they impact performance and resource utilization.
Since I’m just getting started, are there any other knobs or parameters you’d recommend studying to better understand LLMs from a systems perspective?
Also, could you please help me to find the below two metrics:
Thanks in advance.
@kailashg26 , you may want to take a look at this recent talk by Evan showing the impact of some of these trade offs: https://www.youtube.com/watch?v=43X9E25-Qg0
And this about optimization by Jane: https://www.youtube.com/watch?v=xzBcBJ8_rzM
Also, weights and biases if probably your best friend here.
do pip install wandb
, create an account on their website, and run your config like this:
tune run lora_finetune_single_device --config llama3_1/8B_lora_single_device \
metric_logger=torchtune.training.metric_logging.WandBLogger \
log_peak_memory_stats=True
That should be a good start
Thanks, this was very helpful! In fine-tuning, say if I want to track down the execution time for different parts of code, could you suggest any ways to do that? or should I just break the code into multiple segments and profile the blocks?
I did enable all the performance logging knobs and observed this in the profiling outputs when I ran the gemma-2B model.
Could you please let me know what is the time for cpu and cuda account in this case? Also, how to understand the functions within the screenshot and get some take aways?
Thanks in advance for all the guidance you have been offering. Appreciate it!
Like Evan mentioned, to see cpu and cuda execution time, you would have to
To make things easier, trace only one step, so it produces a small file, and run your model for a few steps (e.g <10), so it’s fast to try it out
Thanks! That worked!
I had a doubt regarding the max seq len being _null_
. was there a specific reason for this? what is the default length if takes if I don't set dataset.packed=True
.
Thanks!
Hi @kailashg26, max_seq_len=null
just means that the tokenizer will not truncate the sequences. If you're running without dataset.packed=True
this means that the sequences will be padded to the maximum length in the batch (this is done in the collate fuction).
I see, thanks. What does compile mean? and why does it tend to reduce the training time?
If I understood things right. At the moment, Torchtune currently supports the the following optimizations to reduce the fine-tuning training time right? Parameter-efficient fine-tuning, gradient check-pointing, quantization (bf16), and dataset packing for a single device
Could you please let me know if I missed any other optimizations?
@kailashg26 compile here means torch.compile
. It will compile the model graph and fuse various operations so that they can be run faster. I'd recommend checking out these resources if you want to know a bit more: [1], [2]. I think your summary covers most items we use to speed up training (and of course torch.compile
).
A couple other small points: reduced-precision optimizers like bitsandbytes.optim.8BitAdamW
may also give some slight speedup (though more relevant for full finetune than for e.g. LoRA, where very few parameters are trainable). Also parameter-efficient fine-tuning is not guaranteed to speed up training time universally, e.g. if you are doing QLoRA there are additional upcasts from NF4 to bf16 that need to happen, which makes it a bit slower than vanilla LoRA. But definitely LoRA should be quite a bit faster than full finetuning.
Thanks for the information @ebsmothers. Helps a lot. Appreciate prompt responses as well. So, just curious to understand. Typically when we fine-tune the model with a dataset. The more tokens processed, the better it trains in terms of system-level performance? or do you suggest performing inference on the trained model and check the accuracy?
Also, is there any way that I could find the number of tokens processed per watt? Gaining lot of insights from just a week of experiments! Thanks for interesting presentation and details :)
The more tokens processed, the better it trains in terms of system-level performance? or do you suggest performing inference on the trained model and check the accuracy?
@kailashg26 I think both these things are true. Increasing tokens/sec on a fixed dataset means that your model will be able to learn the same amount in a shorter length of time. But if the dataset is filled with junk, then the more you fine-tune on it the worse your model quality may get.
Also, is there any way that I could find the number of tokens processed per watt?
This I actually don't know offhand. I know WandB will log power usage in their system metrics so you may be able to infer it based on that (I also don't know how they are calculating it though).
Thanks @ebsmothers
Could you please let me know how to track the Latency (total response time (TRT)): in this case, the number of seconds it takes to output 100 tokens?
the number of seconds it takes to output 100 tokens?
@kailashg26 , do you mean during training or generation? If during training, and you have a fixed seq_len, i think you can infer that using TPS, right? But you can always add more logs to your own recipe.
If during generation, we also log it in the recipe. But, you probably wouldnt use our generation recipe as reference. You would probably want to deploy your model using something like vLLM, which is faster.
Hi @felipemello1,
I'm currently fine-tuning LLAMA 3.1 on a single Nvidia RTX 4090. I’ve encountered a few issues and have some questions as I explore the fine-tuning process:
Sequence Length and Memory: I’ve noticed that if I set the max sequence length to 2048, the program runs out of memory. Is there a recommended way to manage this on a single device setup?
Tuning Parameters: In the YAML configuration file, I’ve noticed various knobs and optimization tricks to help speed up training on a domain-specific dataset (I'm currently using the Alpaca dataset).
Regarding the model parameters, I can adjust the batch size, but I’m not seeing options to vary the input sequence length and output sequence length. How can I tune these parameters in the fine-tuning phase to analyze their impact on system run-time from a systems perspective?
Training Duration: The current epoch is set to 1, and I’ve noticed the loss converges, but I’m wondering if you would recommend training for longer? Is there any documentation or guidelines on suggested parameter setups?
Additional Parameters: Are there any other parameters that might be interesting to tune or analyze from a systems perspective during fine-tuning?
Thanks for the help!
you can lower your memory by:
regarding training duration, its hard to say. We dont support early stopping yet (when you run validation after every training epoch and check how well you are doing on the validation set). Ideally, you should keep training until your validation loss plateaus or overfits. Train loss is not the best parameter. You could do a test: train for 1 epoch, train for 3 epochs, run eval on both cases, check the difference. You can also check the issues/pull requests, i believe that someone implemented this on their fork.
And any comments about the input and output sequence length tuning? Could you please let me know how to tune them?
I see this flag:shuffle: True
for dataset. Does that mean, the dataset is randomly shuffled and sampled using the batch size?
Also, could you please let me know the combination of datasets that I could use for training and eval to estimate the accuracy?
Just curious - is there any white paper or documentation that explain the complete operational workflow of fine-tuning?
On the eval framework, if we observe accuracy as 0.81. Does that indicate 81%?
Command: tune run lora_finetune_single_device --config llama3_1/8B_lora_single_device Output:
Can any help me with this?