pytorch / torchtune

PyTorch native finetuning library
https://pytorch.org/torchtune/main/
BSD 3-Clause "New" or "Revised" License
4.34k stars 438 forks source link

Issue with tokenizer LLama-3 on Custom Config #1022

Closed jyrana closed 5 months ago

jyrana commented 5 months ago

I was running custom config and I got this error.

File "/scratch/knv2014/llama_torchtune/torchtune/torchtune/models/llama3/_model_builders.py", line 69, in llama3_tokenizer tiktoken = TikTokenTokenizer(path) ^^^^^^^^^^^^^^^^^^^^^^^ File "/scratch/knv2014/llama_torchtune/torchtune/torchtune/modules/tokenizers/_tiktoken.py", line 108, in init mergeable_ranks = load_tiktoken_bpe(self.path) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/ext3/miniconda3/lib/python3.12/site-packages/tiktoken/load.py", line 150, in load_tiktoken_bpe for token, rank in (line.split() for line in contents.splitlines() if line) ^^^^^^^^^^^ ValueError: not enough values to unpack (expected 2, got 1)

Actually, I am not exactly running Llama-3 but a finedtuned version of it.

I was able to load the model as you can see here,

INFO:torchtune.utils.logging:Logging models/original/torchtune_config.yaml to W&B under Files INFO:torchtune.utils.logging:Model is initialized with precision torch.bfloat16. INFO:torchtune.utils.logging:Memory stats after model init: GPU peak memory allocation: 16.64 GB GPU peak memory reserved: 16.66 GB GPU peak memory active: 16.64 GB Traceback (most recent call last): File "/scratch/knv2014/llama_torchtune/torchtune/tune", line 8, in sys.exit(main()) ^^^^^^ File "/scratch/knv2014/llama_torchtune/torchtune/torchtune/_cli/tune.py", line 49, in main parser.run(args) File "/scratch/knv2014/llama_torchtune/torchtune/torchtune/_cli/tune.py", line 43, in run args.func(args) File "/scratch/knv2014/llama_torchtune/torchtune/torchtune/_cli/run.py", line 179, in _run_cmd self._run_single_device(args) File "/scratch/knv2014/llama_torchtune/torchtune/torchtune/_cli/run.py", line 93, in _run_single_device runpy.run_path(str(args.recipe), run_name="main") File "", line 286, in run_path File "", line 98, in _run_module_code File "", line 88, in _run_code File "/scratch/knv2014/llama_torchtune/torchtune/recipes/lora_finetune_single_device.py", line 564, in sys.exit(recipe_main()) ^^^^^^^^^^^^^ File "/scratch/knv2014/llama_torchtune/torchtune/torchtune/config/_parse.py", line 50, in wrapper sys.exit(recipe_main(conf)) ^^^^^^^^^^^^^^^^^ File "/scratch/knv2014/llama_torchtune/torchtune/recipes/lora_finetune_single_device.py", line 558, in recipe_main recipe.setup(cfg=cfg) File "/scratch/knv2014/llama_torchtune/torchtune/recipes/lora_finetune_single_device.py", line 200, in setup self._tokenizer = config.instantiate(cfg.tokenizer) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/scratch/knv2014/llama_torchtune/torchtune/torchtune/config/_instantiate.py", line 106, in instantiate return _instantiate_node(config, args) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/scratch/knv2014/llama_torchtune/torchtune/torchtune/config/_instantiate.py", line 31, in _instantiate_node return _create_component(component, args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/scratch/knv2014/llama_torchtune/torchtune/torchtune/config/_instantiate.py", line 20, in _create_component return component(args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/scratch/knv2014/llama_torchtune/torchtune/torchtune/models/llama3/_model_builders.py", line 69, in llama3_tokenizer tiktoken = TikTokenTokenizer(path) ^^^^^^^^^^^^^^^^^^^^^^^ File "/scratch/knv2014/llama_torchtune/torchtune/torchtune/modules/tokenizers/_tiktoken.py", line 108, in init mergeable_ranks = load_tiktoken_bpe(self.path) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/ext3/miniconda3/lib/python3.12/site-packages/tiktoken/load.py", line 150, in load_tiktoken_bpe for token, rank in (line.split() for line in contents.splitlines() if line) ^^^^^^^^^^^ ValueError: not enough values to unpack (expected 2, got 1)..

here is my config file:

Model Arguments

model: component: torchtune.models.llama3.lora_llama3_8b lora_attn_modules: ['q_proj', 'v_proj'] apply_lora_to_mlp: False apply_lora_to_output: False lora_rank: 8 lora_alpha: 16

Tokenizer

tokenizer: component: torchtune.models.llama3.llama3_tokenizer path: models/original/tokenizer.model

checkpointer: component: torchtune.utils.FullModelMetaCheckpointer checkpoint_dir: ./models/original checkpoint_files: [ model.pth ] recipe_checkpoint: null output_dir: ./out model_type: LLAMA3 resume_from_checkpoint: False

Dataset and Sampler

dataset: component: torchtune.datasets.alpaca_cleaned_dataset train_on_input: True max_seq_len: 256 seed: null shuffle: True batch_size: 47

Optimizer and Scheduler

optimizer: component: torch.optim.AdamW weight_decay: 0.01 lr: 3e-4 lr_scheduler: component: torchtune.modules.get_cosine_schedule_with_warmup num_warmup_steps: 100

loss: component: torch.nn.CrossEntropyLoss

Training

epochs: 1 max_steps_per_epoch: null gradient_accumulation_steps: 64 compile: False

Logging

output_dir: ./out/lora_finetune_output metric_logger: component: torchtune.utils.metric_logging.WandBLogger log_dir: torchtune-llama3 log_every_n_steps: 5

Environment

device: cuda dtype: bf16 enable_activation_checkpointing: True

Profiler (disabled)

profiler: component: torchtune.utils.profiler enabled: False

I have tried changing path to json file as well but I guess bpe format is different, can you help me resolve this error?

wandgibaut commented 5 months ago

anyone got a solution?

jyrana commented 5 months ago

You can just use llama’s tokeninzer or you have to convert your tokenizer in pairs if you your new model is finetuned on something.

On Sun, Jun 9, 2024 at 4:38 PM Wandemberg Gibaut @.***> wrote:

anyone got a solution?

— Reply to this email directly, view it on GitHub https://github.com/pytorch/torchtune/issues/1022#issuecomment-2156800933, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQOKJQY5RLB56PLBYMN7SF3ZGTDOZAVCNFSM6AAAAABIJI4SQCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNJWHAYDAOJTGM . You are receiving this because you modified the open/close state.Message ID: @.***>

ebsmothers commented 5 months ago

@wandgibaut please feel free to reopen this issue or open a new one if you're still having problems here.