wangyuxinwhy / uniem

unified embedding model
Apache License 2.0
814 stars 61 forks source link

用以下代码微调m3e-large模型后,不相关的句子相似度极高,请问需要如何处理? #84

Closed twwch closed 1 year ago

twwch commented 1 year ago

🐛 bug 说明

用以下代码微调m3e-large模型后,拒绝能力为0,请问需要如何处理

from datasets import load_dataset
from uniem.finetuner import FineTuner
import os
import json

os.environ["CUDA_VISIBLE_DEVICES"] = "1,2,3"

dataset = load_dataset('shibing624/nli-zh-all', cache_dir='cache')
new_dataset=dataset.rename_columns({'text1':'sentence1','text2':'sentence2'})

hug_path='moka-ai/m3e-large'

finetuner = FineTuner.from_pretrained(hug_path, dataset=new_dataset)
finetuned_model = finetuner.run(epochs=3, batch_size=8, lr=3e-5,output_dir=os.path.basename(hug_path))

11fe8931-d58b-4c75-9e25-aa18f6ee9b74 3211e3be-a750-44a2-b751-3e8d22bde42c 57b5eb0a-e9f8-4c07-8a66-5e65c9bca351 98606898-6c1a-480f-b6ff-dfe77e299a3b

Python Version

python3.10

twwch commented 1 year ago

计算相似度的方法是similarity = cosine_similarity([embedding1], [embedding2]).tolist()

wangyuxinwhy commented 1 year ago

拒绝能力是什么意思呀?看起来学习率有一些太大了,最好使用 -6 级别的学习率,比如 5e-6

twwch commented 1 year ago

拒绝能力是什么意思呀?看起来学习率有一些太大了,最好使用 -6 级别的学习率,比如 5e-6

就是相似度一直很高,即使完全不相关,相似度也高达0.99

wangyuxinwhy commented 1 year ago

看起来是学习率的问题,学习率调小一点吧

twwch commented 1 year ago

好的,我试试

twwch commented 1 year ago

确实是学习率太高导致,问题解决了,多谢