arnocandel commented 1 year ago

histograms of byte lengths of instruction/input/output for https://huggingface.co/datasets/ehartford/dolphin

>>> np.histogram(lins)
(array([ 436378,   66227, 2107239,   33373,  925465,   69247,   63003,
             0,       0,   31015]), array([  0.,  40.,  80., 120., 160., 200., 240., 280., 320., 360., 400.]))
>>> np.histogram(linp)
(array([3590631,   82296,   36930,   15174,    5337,     835,     366,
           231,     122,      25]), array([1.20000e+01, 4.06650e+03, 8.12100e+03, 1.21755e+04, 1.62300e+04,
       2.02845e+04, 2.43390e+04, 2.83935e+04, 3.24480e+04, 3.65025e+04,
       4.05570e+04]))
>>> np.histogram(loutp)
(array([3589412,  132897,    9457,      99,      29,      25,      21,
             4,       2,       1]), array([1.00000e+00, 1.67270e+03, 3.34440e+03, 5.01610e+03, 6.68780e+03,
       8.35950e+03, 1.00312e+04, 1.17029e+04, 1.33746e+04, 1.50463e+04,
       1.67180e+04]))

arnocandel commented 1 year ago

Fine-tuning speed comparisons on 2xA6000Ada+4090

Falcon 7B 16-bit

CUDA_VISIBLE_DEVICES=0,1 torchrun --nproc_per_node=2 finetune.py --data_path=ehartford/dolphin --base_model=tiiuae/falcon-7b &> dolphin.7b.txt 0%| | 5/29155 [00:43<72:18:22, 8.93s/it]

Falcon 7B 8-bit

CUDA_VISIBLE_DEVICES=0,1,2 torchrun --nproc_per_node=3 finetune.py --data_path=ehartford/dolphin --base_model=tiiuae/falcon-7b --train_8bit=True &> dolphin.7b.txt 0%| | 14/31099 [04:12<153:34:34, 17.79s/it]

Falcon 7B 4-bit

CUDA_VISIBLE_DEVICES=0,1,2 torchrun --nproc_per_node=3 finetune.py --data_path=ehartford/dolphin --base_model=tiiuae/falcon-7b --train_4bit=True &> dolphin.7b.txt 0%| | 1/31099 [00:19<167:47:13, 19.42s/it]

arnocandel commented 1 year ago

Falcon 7B 16-bit, 2048 context, full LoRA, 0.2 epochs - test run

CUDA_VISIBLE_DEVICES=0,1 torchrun --nproc_per_node=2 finetune.py --data_path=ehartford/dolphin --base_model=tiiuae/falcon-7b --cutoff_len=2048 --drop_truncations=True --num_epochs=0.2 --lora_target_modules='["query_key_value", "dense_h_to_4h", "dense_4h_to_h", "dense"]' &> dolphin.7b.txt

OOM on 2x48GB

arnocandel commented 1 year ago

CUDA_VISIBLE_DEVICES=0,1 torchrun --nproc_per_node=2 finetune.py --data_path=ehartford/dolphin --base_model=tiiuae/falcon-7b --cutoff_len=2048 --drop_truncations=True --num_epochs=0.2 &> dolphin.7b.txt OOM too, so 2k is too much

CUDA_VISIBLE_DEVICES=0,1 torchrun --nproc_per_node=2 finetune.py --data_path=ehartford/dolphin --base_model=tiiuae/falcon-7b --cutoff_len=1536 --drop_truncations=True --num_epochs=0.2 &> dolphin.7b.txt OOM

CUDA_VISIBLE_DEVICES=0,1 torchrun --nproc_per_node=2 finetune.py --data_path=ehartford/dolphin --base_model=tiiuae/falcon-7b --cutoff_len=2048 --drop_truncations=True --num_epochs=0.2 --train_8bit=True

arnocandel commented 1 year ago

Probably overfit on instructions: CUDA_VISIBLE_DEVICES=2 python generate.py --base_model=tiiuae/falcon-7b --lora_weight=falcon-7b.ehartforddolphin.0.2_epochs.0b8d30ad31bcb7762468f8f5fa6c46f04451caad.0/checkpoint-2296 --load_8bit=True --prompt_type=human_bot

arnocandel commented 1 year ago

https://slack-files.com/T0329MHH6-F05G3HMMB32-5ccc51d96c lora checkpoints and logs

arnocandel commented 1 year ago

again without overfitting on instructions:

CUDA_VISIBLE_DEVICES=0,1 torchrun --nproc_per_node=2 finetune.py --data_path=ehartford/dolphin --base_model=tiiuae/falcon-7b --cutoff_len=2048 --drop_truncations=True --num_epochs=0.2 --train_8bit=True --train_on_inputs=False &> dolphin.7b.notraininstructions.log dolphin.7b.notraininstructions.log https://slack-files.com/T0329MHH6-F05FWQ0LRUP-ea1604b9b8 logs and checkpoints/lora weights 130 GPU-hours

CUDA_VISIBLE_DEVICES=2 python generate.py --base_model=tiiuae/falcon-7b --lora_weight=falcon-7b.ehartforddolphin.0.2_epochs.c432387e2099171f2332a0da1126103fc549cba7.0 --load_8bit=True --prompt_type=human_bot

arnocandel commented 1 year ago

again with --prompt_type=human_bot, but without cleaning up data yet, just for quick sanity-check:

CUDA_VISIBLE_DEVICES=0,1 torchrun --nproc_per_node=2 finetune.py --data_path=ehartford/dolphin --base_model=tiiuae/falcon-7b --cutoff_len=2048 --drop_truncations=True --prompt_type=human_bot --num_epochs=0.01 --train_8bit=True --train_on_inputs=False &> dolphin.7b.human_bot.log

all good. Now doing 0.1 epochs, still with instructions (system prompt), even thought might not need/want, also no personalization (yet):

PYTHONPATH=. CUDA_VISIBLE_DEVICES=0,1 torchrun --nproc_per_node=2 finetune.py --data_path=ehartford/dolphin --base_model=tiiuae/falcon-7b --cutoff_len=2048 --drop_truncations=True --prompt_type=human_bot --num_epochs=0.1 --train_8bit=True --train_on_inputs=False &> dolphin.7b.human_bot.log

arnocandel commented 1 year ago

2xA100 80GB

PYTHONPATH=. CUDA_VISIBLE_DEVICES=0,1 torchrun --nproc_per_node=2 finetune.py --data_path=ehartford/dolphin --base_model=tiiuae/falcon-7b --cutoff_len=2048 --drop_truncations=True --prompt_type=human_bot --num_epochs=1 --train_on_inputs=False --train_4bit=True &> dolphin.7b.human_bot.log

h2oai / h2ogpt

Fine-tune on ehartford/dolphin #376

Fine-tuning speed comparisons on 2xA6000Ada+4090

Falcon 7B 16-bit

Falcon 7B 8-bit

Falcon 7B 4-bit

Falcon 7B 16-bit, 2048 context, full LoRA, 0.2 epochs - test run

2xA100 80GB