Closed Nipi64310 closed 1 year ago
我看M3E把BQ、LCQMC、PAWSX的测试集也加进去训练了。比较那几个不一定公平。
我看M3E把BQ、LCQMC、PAWSX的测试集也加进去训练了。比较那几个不一定公平。
Hello @hjq133 , 但是实际测试,m3e-base在这几个测试集上效果都比较差。。 测试脚本如下,修改于 https://github.com/bojone/BERT-whitening/tree/main/chn
from datasets import load_dataset
dataset = load_dataset("shibing624/nli_zh", "BQ") # ATEC or BQ or LCQMC or PAWSX or STS-B
print(dataset)
print(dataset['test'][:2])
import numpy as np
import scipy.stats
from sentence_transformers import SentenceTransformer
model_path = 'moka-ai/m3e-base'
model = SentenceTransformer(model_path,device='cuda')
def convert_to_vecs(data):
"""转换文本数据为向量形式
"""
a_vecs = model.encode(data['sentence1'],batch_size=32)
b_vecs = model.encode(data['sentence2'],batch_size=32)
return a_vecs, b_vecs, np.array(data['label'])
a_vecs, b_vecs,labels = convert_to_vecs(dataset['test'])
def transform_and_normalize(vecs, kernel=None, bias=None):
"""应用变换,然后标准化
"""
if not (kernel is None or bias is None):
vecs = (vecs + bias).dot(kernel)
norms = (vecs**2).sum(axis=1, keepdims=True)**0.5
return vecs / np.clip(norms, 1e-8, np.inf)
def compute_corrcoef(x, y):
"""Spearman相关系数
"""
return scipy.stats.spearmanr(x, y).correlation
# 变换,标准化,相似度,相关系数
all_corrcoefs = []
a_vecs = transform_and_normalize(a_vecs)
b_vecs = transform_and_normalize(b_vecs)
sims = (a_vecs * b_vecs).sum(axis=1)
corrcoef = compute_corrcoef(labels, sims)
all_corrcoefs.append(corrcoef)
print(all_corrcoefs)
# [0.6381030399066687]
这个是我看之前 text2vec 的评测结果
至于为什么没有使用 text2vec 的评测集,主要有两个考虑
比如,上面图片中的 SOHU 数据集,text2vec 是在这个数据集上训练的,而 M3E 是从来没有见过这个数据集的,因此这样的比较肯定是有问题的,所以 text2vec 的作者在新的 README 中删掉了这一行。
我看M3E把BQ、LCQMC、PAWSX的测试集也加进去训练了。比较那几个不一定公平。
Hello @hjq133 , 但是实际测试,m3e-base在这几个测试集上效果都比较差。。 测试脚本如下,修改于 https://github.com/bojone/BERT-whitening/tree/main/chn
from datasets import load_dataset dataset = load_dataset("shibing624/nli_zh", "BQ") # ATEC or BQ or LCQMC or PAWSX or STS-B print(dataset) print(dataset['test'][:2]) import numpy as np import scipy.stats from sentence_transformers import SentenceTransformer model_path = 'moka-ai/m3e-base' model = SentenceTransformer(model_path,device='cuda') def convert_to_vecs(data): """转换文本数据为向量形式 """ a_vecs = model.encode(data['sentence1'],batch_size=32) b_vecs = model.encode(data['sentence2'],batch_size=32) return a_vecs, b_vecs, np.array(data['label']) a_vecs, b_vecs,labels = convert_to_vecs(dataset['test']) def transform_and_normalize(vecs, kernel=None, bias=None): """应用变换,然后标准化 """ if not (kernel is None or bias is None): vecs = (vecs + bias).dot(kernel) norms = (vecs**2).sum(axis=1, keepdims=True)**0.5 return vecs / np.clip(norms, 1e-8, np.inf) def compute_corrcoef(x, y): """Spearman相关系数 """ return scipy.stats.spearmanr(x, y).correlation # 变换,标准化,相似度,相关系数 all_corrcoefs = [] a_vecs = transform_and_normalize(a_vecs) b_vecs = transform_and_normalize(b_vecs) sims = (a_vecs * b_vecs).sum(axis=1) corrcoef = compute_corrcoef(labels, sims) all_corrcoefs.append(corrcoef) print(all_corrcoefs) # [0.6381030399066687]
你这比较不太对。他那个是拿pretrain单独在每个集子上单独finetune的结果、 所以你得拿M3E单独每个集子train后再去测试。 https://github.com/shibing624/text2vec/issues/51
我看M3E把BQ、LCQMC、PAWSX的测试集也加进去训练了。比较那几个不一定公平。
Hello @hjq133 , 但是实际测试,m3e-base在这几个测试集上效果都比较差。。 测试脚本如下,修改于 https://github.com/bojone/BERT-whitening/tree/main/chn
from datasets import load_dataset dataset = load_dataset("shibing624/nli_zh", "BQ") # ATEC or BQ or LCQMC or PAWSX or STS-B print(dataset) print(dataset['test'][:2]) import numpy as np import scipy.stats from sentence_transformers import SentenceTransformer model_path = 'moka-ai/m3e-base' model = SentenceTransformer(model_path,device='cuda') def convert_to_vecs(data): """转换文本数据为向量形式 """ a_vecs = model.encode(data['sentence1'],batch_size=32) b_vecs = model.encode(data['sentence2'],batch_size=32) return a_vecs, b_vecs, np.array(data['label']) a_vecs, b_vecs,labels = convert_to_vecs(dataset['test']) def transform_and_normalize(vecs, kernel=None, bias=None): """应用变换,然后标准化 """ if not (kernel is None or bias is None): vecs = (vecs + bias).dot(kernel) norms = (vecs**2).sum(axis=1, keepdims=True)**0.5 return vecs / np.clip(norms, 1e-8, np.inf) def compute_corrcoef(x, y): """Spearman相关系数 """ return scipy.stats.spearmanr(x, y).correlation # 变换,标准化,相似度,相关系数 all_corrcoefs = [] a_vecs = transform_and_normalize(a_vecs) b_vecs = transform_and_normalize(b_vecs) sims = (a_vecs * b_vecs).sum(axis=1) corrcoef = compute_corrcoef(labels, sims) all_corrcoefs.append(corrcoef) print(all_corrcoefs) # [0.6381030399066687]
你这比较不太对。他那个是拿pretrain单独在每个集子上单独finetune的结果、 所以你得拿M3E单独每个集子train后再去测试。 shibing624/text2vec#51
对的,应该看我上面回复的那张图。另外,如果是想要看那个模型更适合,直接在自己的场景上面测试是最实在的~
感谢回复
🚀 The feature
Hi @wangyuxinwhy , 请问mteb-zh评测测试集为什么跟text2vec测试集不同?是因为m3e训练数据大部分都是qa形式吗,短q长a,而相似度数据偏向于较短的qq pair吗
比如这几个测试集 https://github.com/bojone/BERT-whitening/tree/main/chn