Saved tokenizer is weird after SFT

MrGGLS commented 1 week ago

llama3-sft.yaml

### model
model_name_or_path: models/llama-3-8b-Instruct

### method
stage: sft
do_train: true
finetuning_type: full
deepspeed: examples/deepspeed/ds_z3_config.json

### dataset
dataset: sft_data_mixed_v1.0_sharegpt_dmg27l70q72_0.4
template: llama3
cutoff_len: 2048
overwrite_cache: true
preprocessing_num_workers: 16

### output
output_dir: saves/llama3/sft_data_mixed_v1.0_sharegpt_dmg27l70q72_0.4_neatpacking_3epo_lr2e-5
logging_steps: 5
save_steps: 10086
plot_loss: true
overwrite_output_dir: true

### train
per_device_train_batch_size: 8
gradient_accumulation_steps: 4
learning_rate: 2.0e-5
num_train_epochs: 3.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000

### custom
do_eval: false
packing: true
neat_packing: true
flash_attn: fa2
save_strategy: "no"
save_total_limit: 1
seed: 42
save_only_model: true
gradient_checkpointing: true
gradient_checkpointing_kwargs:
  use_reentrant: False

### eval
# val_size: 0.1
# per_device_eval_batch_size: 1
# eval_strategy: no
# eval_steps: 500

The training process is fine, but the saved tokenizer file is much larger than the one provided by the original llama-3

tokenzier.json (from llama3): 8.66 MB
tokenzier.json (from trained model): 16.44 MB

When performing inference using vllm later on, an error is reported:

...
Exception: data did not match any variant of untagged enum ModelWrapper at line 1250944 column 3

Need to manually modify the tokenizer configuration to the original one in order to perform normal inference.

luzengxiangcn commented 1 week ago

save problem!

HideLord commented 1 week ago

Same issue

MrGGLS commented 1 week ago

I found that this issue was caused by the version of the transformers. I was using the latest version before, but now I’ve downgraded to 4.43.4, and the problem has been resolved

hiyouga commented 6 days ago

hiyouga / LLaMA-Factory

Saved tokenizer is weird after SFT #5656