Open felipemello1 opened 1 month ago
These are really interesting experiments. What would be a bit more helpful is to plot each combination of (bsz, max_seq_len) on a single plot with the two axes being WPS and peak memory. That way I can immediately see the tradeoff between memory and speed. Although, with the way it's laid out now, I find it interesting that increasing max seq len does increase throughput upto a certain point, then it hurts it. And batch size seems to be the biggest contributor to increasing throughput, even over the speed optimized config.
Another thing that would be helpful is if you could list out what flags you're modifying for the memory and speed optimized configs.
I created a script to check the tokens_per_second and peak_memory for any config/recipe testing different flags. It still needs some polishing, but I wanted to get comments about the outcome before I go further and submit a PR.
Goal:
To have a page with these 2 graphs and table for every model config (thats a lot of configs :O), so users can understand their options.
PS: It sounds like it will be frustrating to update all of these anytime we make a major change. If we could automate it with CI, that would be nice, but is that possible?
TLDR:
1) Take DEFAULT config and iterate over [bsz, max_seq_len] 2) Apply all MEMORY related flags and iterate over [bsz, max_seq_len] 3) Apply all SPEED related flags and iterate over [bsz, max_seq_len] 4) Take smallest and max_seq_len that runs without OOM, and iterate over the flags one by one, so we can measure impact.
IMPORTANT: default values differ from our configs. They are the values without optimization, e.g. checkpointing, regular Adam, etc.
Outcomes:
Under the hood, the flags look like this:
Blockers
Want these three low hanging fruits to land first:
Run:
python profile_config.py --yaml_name llama3_1/8B_lora_single_device --recipe_type lora_finetune_single_device
Outputs (still needs polishing):
max_seq_len = 1024