Open kaisugi opened 1 week ago
Note: I found on X(Twitter) that one of the authors (@bwanglzu ) has already completed the evaluations 😳 https://x.com/bo_wangbo/status/1838919204377911477
Thank you for the information! I managed to run the evaluation of the model yesterday, but didn't succeed. Keeping debugging now.
Great, look forward to the official results 😊
@kaisugi
I tried the model with fast datasets, but I found that tasks except Classification
worked better without LoRA than with LoRA. In Retrieval
task, the results were better without prefixes.
The results that is the most similar with https://x.com/bo_wangbo/status/1838919204377911477 are, no prefixes, no LoRA except Classification
.
My results are as below:
no prefixes, no LoRA except Classification
{
"Classification": {
"amazon_counterfactual_classification": {
"macro_f1": 0.7949948725329687
},
"massive_intent_classification": {
"macro_f1": 0.7766347542682803
},
"massive_scenario_classification": {
"macro_f1": 0.8982075621284786
}
},
"Retrieval": {
"jagovfaqs_22k": {
"ndcg@10": 0.7449944044307708
},
"nlp_journal_abs_intro": {
"ndcg@10": 0.9941946751679634
},
"nlp_journal_title_abs": {
"ndcg@10": 0.9717376985433034
},
"nlp_journal_title_intro": {
"ndcg@10": 0.9609029386920315
}
},
"STS": {
"jsick": {
"spearman": 0.8146985042196159
},
"jsts": {
"spearman": 0.8068520872331155
}
},
"Clustering": {
"livedoor_news": {
"v_measure_score": 0.5036707354224619
},
"mewsc16": {
"v_measure_score": 0.474391205388421
}
},
"PairClassification": {
"paws_x_ja": {
"binary_f1": 0.623716814159292
}
}
}
no prefixes, with LoRA
{
"Classification": {
"amazon_counterfactual_classification": {
"macro_f1": 0.7949948725329687
},
"massive_intent_classification": {
"macro_f1": 0.7766347542682803
},
"massive_scenario_classification": {
"macro_f1": 0.8982075621284786
}
},
"Retrieval": {
"jagovfaqs_22k": {
"ndcg@10": 0.7255870901661032
},
"nlp_journal_abs_intro": {
"ndcg@10": 0.9829431790599418
},
"nlp_journal_title_abs": {
"ndcg@10": 0.9552122947731903
},
"nlp_journal_title_intro": {
"ndcg@10": 0.9324205002364649
}
},
"STS": {
"jsick": {
"spearman": 0.7816133481804449
},
"jsts": {
"spearman": 0.8193021839272429
}
},
"Clustering": {
"livedoor_news": {
"v_measure_score": 0.5387525923415666
},
"mewsc16": {
"v_measure_score": 0.43532523021586217
}
},
"PairClassification": {
"paws_x_ja": {
"binary_f1": 0.623716814159292
}
}
}
with prefixes, with LoRA
{
"Classification": {
"amazon_counterfactual_classification": {
"macro_f1": 0.7949948725329687
},
"massive_intent_classification": {
"macro_f1": 0.7766347542682803
},
"massive_scenario_classification": {
"macro_f1": 0.8982075621284786
}
},
"Retrieval": {
"jagovfaqs_22k": {
"ndcg@10": 0.7157443309160252
},
"nlp_journal_abs_intro": {
"ndcg@10": 0.9849100129100982
},
"nlp_journal_title_abs": {
"ndcg@10": 0.9560377251324601
},
"nlp_journal_title_intro": {
"ndcg@10": 0.9372937234643258
}
},
"STS": {
"jsick": {
"spearman": 0.7816133481804449
},
"jsts": {
"spearman": 0.8193021839272429
}
},
"Clustering": {
"livedoor_news": {
"v_measure_score": 0.5313213726075848
},
"mewsc16": {
"v_measure_score": 0.43532523021586217
}
},
"PairClassification": {
"paws_x_ja": {
"binary_f1": 0.623716814159292
}
}
}
LoRA settings (if w/):
classification
: Classificationtext-matching
: STS, PairClassificationsaparation
: Clustering, Rerankingretrieval.query
: Retrieval (when encoding queries)retrieval.passage
: Retrieval (when encoding documents)Prefix settings (if w/):
Thank you so much for your hard work!
hi @lsz05 @courage i hacked a bit the code to make it work the thing i changed:
src/jmteb/embedders/base.py
in the TextEmbeder class, i added a task
parameter to make sure task
is correctly send to the encode
function.
src/jmteb/embedders/sbert_embedder.py
in the SentenceBertEmbedder
class, i chanted max_seq_length
to 512 since some of the task (MrTidy) is too slow, and i added prompt_name
and task
into encode
function, prompt_name
is set to identical task as defined here: we use 2 instructions for retrieval adapter.
/src/jmteb/evaluators/retrieval/evaluator.py
to send different task
during indexing and searching:I agree my code is a bit "dirty" as i only want to quickly check the results :) hopefully you understand. If i missed anything in your code base result in a different eval result please let me know :)
but i'm also quite surprised (in a good way) your score is better than what i reported lol :) maybe there is something wrong in my code, but at least not worse , for mewsc16 clustering
i noticed my score is higher than yours, this is what i have:
{
"metric_name": "v_measure_score",
"metric_value": 0.4966872142615049,
"details": {
"optimal_clustering_model_name": "AgglomerativeClustering",
"val_scores": {
"MiniBatchKMeans": {
"v_measure_score": 0.4573582252706992,
"homogeneity_score": 0.49434785878175236,
"completeness_score": 0.425518738350574
},
"AgglomerativeClustering": {
"v_measure_score": 0.5159727698724647,
"homogeneity_score": 0.558382996336062,
"completeness_score": 0.47955005205722434
},
"BisectingKMeans": {
"v_measure_score": 0.45289840369081835,
"homogeneity_score": 0.4964330478306176,
"completeness_score": 0.4163836804409629
},
"Birch": {
"v_measure_score": 0.4943869746128702,
"homogeneity_score": 0.5396066604339305,
"completeness_score": 0.4561602021543821
}
},
"test_scores": {
"AgglomerativeClustering": {
"v_measure_score": 0.4966872142615049,
"homogeneity_score": 0.5340024254176485,
"completeness_score": 0.4642464368511074
}
}
}
}
hi @lsz05 @courage i hacked a bit the code to make it work the thing i changed:
src/jmteb/embedders/base.py
in the TextEmbeder class, i added a
task
parameter to make suretask
is correctly send to theencode
function.
src/jmteb/embedders/sbert_embedder.py
in the
SentenceBertEmbedder
class, i chantedmax_seq_length
to 512 since some of the task (MrTidy) is too slow, and i addedprompt_name
andtask
intoencode
function,prompt_name
is set to identical task as defined here: we use 2 instructions for retrieval adapter.
- only running retrieval task, i modified
/src/jmteb/evaluators/retrieval/evaluator.py
to send differenttask
during indexing and searching:I agree my code is a bit "dirty" as i only want to quickly check the results :) hopefully you understand. If i missed anything in your code base result in a different eval result please let me know :)
I think I'm doing the same thing as you in #80
but i'm also quite surprised (in a good way) your score is better than what i reported lol :) maybe there is something wrong in my code, but at least not worse , for
mewsc16 clustering
i noticed my score is higher than yours, this is what i have:{ "metric_name": "v_measure_score", "metric_value": 0.4966872142615049, "details": { "optimal_clustering_model_name": "AgglomerativeClustering", "val_scores": { "MiniBatchKMeans": { "v_measure_score": 0.4573582252706992, "homogeneity_score": 0.49434785878175236, "completeness_score": 0.425518738350574 }, "AgglomerativeClustering": { "v_measure_score": 0.5159727698724647, "homogeneity_score": 0.558382996336062, "completeness_score": 0.47955005205722434 }, "BisectingKMeans": { "v_measure_score": 0.45289840369081835, "homogeneity_score": 0.4964330478306176, "completeness_score": 0.4163836804409629 }, "Birch": { "v_measure_score": 0.4943869746128702, "homogeneity_score": 0.5396066604339305, "completeness_score": 0.4561602021543821 } }, "test_scores": { "AgglomerativeClustering": { "v_measure_score": 0.4966872142615049, "homogeneity_score": 0.5340024254176485, "completeness_score": 0.4642464368511074 } } } }
I think I'll have to fix some randomness problems (e.g., fix the random seed in training to make sure everything can be exactly reproduced) in Clustering
and Classification
(where training is conducted). As the method that works best in dev set will be chosen, in my case Birch
worked slightly better in dev but not so well in test, as a result the test score is not as high as your eval.
My result is as following
i think your PR looks good, maybe two things:
model.half()
to make it a bit faster.I'm not sure why using LoRA make the performance a bit worse than w.o. LoRA (for example, on STS). Using LoRA is always my default choice :)
one small thing to notice is prefix
is only applied to Retrieval, not other tasks.
btw have you considered to move JMTEB to the official MTEB leaderboard, this will greatly simplify your work.
モデルの基本情報
name: jina-embeddings-v3 type: XLMRoBERTa (+ LoRA Adapter) size: 559M (LoRA Adapterを加えると572M) lang: multilingual
モデル詳細
https://arxiv.org/abs/2409.10173 https://huggingface.co/jinaai/jina-embeddings-v3
seen/unseen申告
JMTEBの評価データセットの中,training splitをモデル学習に使用した,またはvalidation setとして,ハイパラチューニングやearly stoppingに使用したデータセット名をチェックしてください。
評価スクリプト
その他の情報