After fine-tuning, the model inference is abnormal

mumu029 commented 5 months ago

System Info

When I used P-tuning V2 to fine-tune GLM, the loss reduction was very noticeable, but in the actual inference, I made a lot of noise. I use the training data again for inference. Examples of training data:

{'input_ids': [151331, 151333, 151336, 198, 30989, 264, 8543, 3118, 389, 279, 2213, 358, 6551, 498, 624, 2762, 25, 1986, 3395, 4549, 6081, 264, 15805, 23503, 315, 279, 86117, 323, 4763, 18384, 11, 18173, 87497, 18906, 11, 88390, 25107, 11, 31652, 11, 323, 10654, 13, 1084, 40102, 279, 14964, 8356, 315, 86117, 304, 5257, 30358, 11, 2670, 9433, 24162, 11, 6371, 4763, 11, 323, 17917, 5440, 13, 576, 4549, 1083, 14220, 88390, 8775, 11, 5546, 2660, 23406, 11, 12339, 10515, 11, 5777, 323, 4842, 4714, 11, 323, 3853, 18310, 13, 3216, 8240, 458, 304, 30193, 6358, 315, 1493, 13557, 11, 279, 3395, 71242, 279, 9020, 3476, 315, 86117, 304, 47669, 287, 1995, 323, 12339, 304, 279, 7377, 4231, 13, 151337, 198, 34, 46376, 323, 8233, 25, 362, 66376, 10289, 315, 85052, 11, 11702, 21964, 11, 323, 68179, 151329], 'labels': [-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, 198, 34, 46376, 323, 8233, 25, 362, 66376, 10289, 315, 85052, 11, 11702, 21964, 11, 323, 68179, 151329]}
[gMASK] <sop> <|user|> 
Generate a topic based on the content I gave you.
Content:This review article offers a comprehensive examination of the cryptography and security landscape, covering foundational concepts, cryptographic algorithms, protocols, and standards. It explores the practical applications of cryptography in various domains, including cloud computing, mobile security, and blockchain technology. The article also addresses cryptographic attacks, countermeasures, privacy concerns, legal and policy issues, and future trends. By providing an in-depth analysis of these aspects, the review underscores the critical role of cryptography in safeguarding information and privacy in the digital age. <|assistant|> 
Cryptography and Security: A Comprehensive Review of Algorithms, Protocols, and Challenges <|endoftext|>

Let me add to my environment: I only had 42 training samples (I don't know if 42 samples with 100 virtual tokens makes sense, but I would run into the following problems if I made my virtual tokens smaller); In addition, during the fine-tuning process, I occasionally encountered the situation that grad norm was equal to nan (in this case, the loss may be directly equal to 0, or there is no effect). After meeting this situation, I either increased the num virtual token or fine-tuned from scratch, which could "solve" this unexpected problem. When I asked others, some said catastrophic forgetting, so I added common sense questions to the training set. Try again, it still doesn't work.

Who can help?

No response

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder
[X] My own task or dataset (give details below)

Reproduction

train.py

model_path = "/home/data/glm-4-9b-chat/"
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code = True)
model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code = True, torch_dtype = torch.float16)
peft_config = PrefixTuningConfig(
    task_type=TaskType.CAUSAL_LM,
    num_attention_heads=2,
    token_dim=256,
    prefix_projection=False,
    num_virtual_tokens=100
)
model = get_peft_model(model, peft_config)

generation_config = GenerationConfig(
        max_new_tokens = 64,
        eos_token_id = [151329, 151336, 151338],
        pad_token_id = 151329
    )
train_config = Seq2SeqTrainingArguments(
    output_dir="./output",
    overwrite_output_dir=True,
    max_steps=1000,
    fp16=True,
    learning_rate=5e-4,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    dataloader_num_workers=16,
    per_device_eval_batch_size=2,
    logging_dir="./output",
    log_level="info",
    logging_steps=30,
    evaluation_strategy="steps",
    eval_steps=60,
    save_steps=800,
    predict_with_generate=True,
    remove_unused_columns=False,
    generation_config=generation_config
)
trainer = Seq2SeqTrainer(
        model=model,
        args=train_config,
        train_dataset=train_data,
        eval_dataset=val_data,
        compute_metrics=partial(compute_metrics,tokenizer=tokenizer),
        data_collator=DataCollatorForSeq2Seq(
            tokenizer=tokenizer,
            return_tensors="pt",
            padding="longest"
        )
    )
train.train()

eval.py

adapter_path = "xxxxxxx"
model = AutoPeftModelForCausalLM.from_pretrained(adapter_path, trust_remote_code=True, device_map = "auto", torch_dtype = torch.float16)
tokenizer = AutoTokenizer.from_pretrained(model.peft_config["default"].base_model_name_or_path,trust_remote_code = True)

message = tokenizer.apply_chat_template([{"role" : "user", "content" : "Generate a topic based on the content I gave you.\nContent: This review article provides a comprehensive overview of the state-of-the-art in content-based image and video retrieval (CBIR). It covers the fundamental concepts, advanced techniques, and system design aspects of CBIR, with a focus on recent advancements and future directions. The article discusses the evolution of CBIR systems from early methods to sophisticated techniques, emphasizing the role of deep learning and neural networks. It also addresses the challenges of performance evaluation, the significance of advanced descriptors, and the impact of system architecture and databases. Furthermore, the review explores the trends and future directions in CBIR, including the bridging of the semantic gap, the integration of cross-modal retrieval, and the potential for CBIR to be integrated with other AI technologies. The article aims to serve as a reference for researchers and practitioners in the field, highlighting the dynamic nature of CBIR and its potential to shape the future of multimedia information retrieval."}],
                                        add_generation_prompt=True,
                                        return_tensors="pt",
                                        return_dict=True,
                                        tokenize=True).to(device)

output = model.generate(**message,max_new_tokens=64,do_sample=True,top_p=0.8,temperature=0.8,repetition_penalty= 1.2,eos_token_id = model.config.eos_token_id)
# temp = output[0][len(message[0]):]
temp = output[0][len(message["input_ids"][0]):]
print(temp)
output_text = tokenizer.decode(temp)
print(type(output_text))
print(output_text)

Expected behavior

Expectations: Advancements and Future Directions in Content-Based Image and Video Retrieval: A Comprehensive Review Actual: Gibberish

BenjaminBossan commented 5 months ago

This type of issue is difficult to diagnose at a distance. You already mentioned trying out different hyper-parameters without any improvement.

My biggest suspicion is indeed that the size of the training dataset is too small. Checking the original paper, the datasets they used for the experiments use tens of thousands of samples. Since you only have 42 samples, I would strongly consider if few shot prompting is not a better approach for your problem.

github-actions[bot] commented 4 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

huggingface / peft