[EVAL REQUEST] jina-embeddings-v3

kaisugi commented 1 week ago

モデルの基本情報

name: jina-embeddings-v3 type: XLMRoBERTa (+ LoRA Adapter) size: 559M (LoRA Adapterを加えると572M) lang: multilingual

モデル詳細

https://arxiv.org/abs/2409.10173 https://huggingface.co/jinaai/jina-embeddings-v3

seen/unseen申告

JMTEBの評価データセットの中，training splitをモデル学習に使用した，またはvalidation setとして，ハイパラチューニングやearly stoppingに使用したデータセット名をチェックしてください。

Classification
- [ ] Amazon Review Classification
- [ ] Amazon Counterfactual Classification
- [ ] Massive Intent Classification
- [ ] Massive Scenario Classification
Clustering
- [ ] Livedoor News
- [ ] MewsC-16-ja
STS
- [ ] JSTS
- [ ] JSICK
Pair Classification
- [ ] PAWS-X-ja
Retrieval
- [ ] JAQKET
- [x] Mr.TyDi-ja (The original English version seems to have been used)
- [ ] JaGovFaqs-22k
- [ ] NLP Journal title-abs
- [ ] NLP Journal title-intro
- [ ] NLP Journal abs-intro
Reranking
- [ ] Esci
[ ] 申告しません

評価スクリプト

その他の情報

kaisugi commented 2 days ago

Note: I found on X(Twitter) that one of the authors (@bwanglzu ) has already completed the evaluations 😳 https://x.com/bo_wangbo/status/1838919204377911477

lsz05 commented 1 day ago

Thank you for the information! I managed to run the evaluation of the model yesterday, but didn't succeed. Keeping debugging now.

kaisugi commented 1 day ago

Great, look forward to the official results 😊

lsz05 commented 16 hours ago

@kaisugi

I tried the model with fast datasets, but I found that tasks except Classification worked better without LoRA than with LoRA. In Retrieval task, the results were better without prefixes.

The results that is the most similar with https://x.com/bo_wangbo/status/1838919204377911477 are, no prefixes, no LoRA except Classification.

My results are as below:

no prefixes, no LoRA except Classification

{
"Classification": {
    "amazon_counterfactual_classification": {
        "macro_f1": 0.7949948725329687
    },
    "massive_intent_classification": {
        "macro_f1": 0.7766347542682803
    },
    "massive_scenario_classification": {
        "macro_f1": 0.8982075621284786
    }
},
"Retrieval": {
    "jagovfaqs_22k": {
        "ndcg@10": 0.7449944044307708
    },
    "nlp_journal_abs_intro": {
        "ndcg@10": 0.9941946751679634
    },
    "nlp_journal_title_abs": {
        "ndcg@10": 0.9717376985433034
    },
    "nlp_journal_title_intro": {
        "ndcg@10": 0.9609029386920315
    }
},
"STS": {
    "jsick": {
        "spearman": 0.8146985042196159
    },
    "jsts": {
        "spearman": 0.8068520872331155
    }
},
"Clustering": {
    "livedoor_news": {
        "v_measure_score": 0.5036707354224619
    },
    "mewsc16": {
        "v_measure_score": 0.474391205388421
    }
},
"PairClassification": {
    "paws_x_ja": {
        "binary_f1": 0.623716814159292
    }
}
}

no prefixes, with LoRA

{
"Classification": {
    "amazon_counterfactual_classification": {
        "macro_f1": 0.7949948725329687
    },
    "massive_intent_classification": {
        "macro_f1": 0.7766347542682803
    },
    "massive_scenario_classification": {
        "macro_f1": 0.8982075621284786
    }
},
"Retrieval": {
    "jagovfaqs_22k": {
        "ndcg@10": 0.7255870901661032
    },
    "nlp_journal_abs_intro": {
        "ndcg@10": 0.9829431790599418
    },
    "nlp_journal_title_abs": {
        "ndcg@10": 0.9552122947731903
    },
    "nlp_journal_title_intro": {
        "ndcg@10": 0.9324205002364649
    }
},
"STS": {
    "jsick": {
        "spearman": 0.7816133481804449
    },
    "jsts": {
        "spearman": 0.8193021839272429
    }
},
"Clustering": {
    "livedoor_news": {
        "v_measure_score": 0.5387525923415666
    },
    "mewsc16": {
        "v_measure_score": 0.43532523021586217
    }
},
"PairClassification": {
    "paws_x_ja": {
        "binary_f1": 0.623716814159292
    }
}
}

with prefixes, with LoRA

{
"Classification": {
    "amazon_counterfactual_classification": {
        "macro_f1": 0.7949948725329687
    },
    "massive_intent_classification": {
        "macro_f1": 0.7766347542682803
    },
    "massive_scenario_classification": {
        "macro_f1": 0.8982075621284786
    }
},
"Retrieval": {
    "jagovfaqs_22k": {
        "ndcg@10": 0.7157443309160252
    },
    "nlp_journal_abs_intro": {
        "ndcg@10": 0.9849100129100982
    },
    "nlp_journal_title_abs": {
        "ndcg@10": 0.9560377251324601
    },
    "nlp_journal_title_intro": {
        "ndcg@10": 0.9372937234643258
    }
},
"STS": {
    "jsick": {
        "spearman": 0.7816133481804449
    },
    "jsts": {
        "spearman": 0.8193021839272429
    }
},
"Clustering": {
    "livedoor_news": {
        "v_measure_score": 0.5313213726075848
    },
    "mewsc16": {
        "v_measure_score": 0.43532523021586217
    }
},
"PairClassification": {
    "paws_x_ja": {
        "binary_f1": 0.623716814159292
    }
}
}

LoRA settings (if w/):

classification: Classification
text-matching: STS, PairClassification
saparation: Clustering, Reranking
retrieval.query: Retrieval (when encoding queries)
retrieval.passage: Retrieval (when encoding documents)

Prefix settings (if w/):

for queries in Retrieval: https://huggingface.co/jinaai/jina-embeddings-v3/blob/54da5c0cbd10718cb32b8925c9607a53c5eec8d2/config_sentence_transformers.json#L8
for documents in Retrieval: https://huggingface.co/jinaai/jina-embeddings-v3/blob/54da5c0cbd10718cb32b8925c9607a53c5eec8d2/config_sentence_transformers.json#L9
for other tasks: no prefix

kaisugi commented 16 hours ago

Thank you so much for your hard work!

bwanglzu commented 15 hours ago

hi @lsz05 @courage i hacked a bit the code to make it work the thing i changed:

src/jmteb/embedders/base.py

in the TextEmbeder class, i added a task parameter to make sure task is correctly send to the encode function.

1E08FB06-F07A-4F5F-A69B-776F9CB665E1

src/jmteb/embedders/sbert_embedder.py

in the SentenceBertEmbedder class, i chanted max_seq_length to 512 since some of the task (MrTidy) is too slow, and i added prompt_name and task into encode function, prompt_name is set to identical task as defined here: we use 2 instructions for retrieval adapter.

84C90704-790D-4429-9A03-657350FF6DE3

only running retrieval task, i modified /src/jmteb/evaluators/retrieval/evaluator.py to send different task during indexing and searching:

86295151-F230-4B2D-B2F0-575272DCF5AB

I agree my code is a bit "dirty" as i only want to quickly check the results :) hopefully you understand. If i missed anything in your code base result in a different eval result please let me know :)

bwanglzu commented 15 hours ago

but i'm also quite surprised (in a good way) your score is better than what i reported lol :) maybe there is something wrong in my code, but at least not worse , for mewsc16 clustering i noticed my score is higher than yours, this is what i have:

{
    "metric_name": "v_measure_score",
    "metric_value": 0.4966872142615049,
    "details": {
        "optimal_clustering_model_name": "AgglomerativeClustering",
        "val_scores": {
            "MiniBatchKMeans": {
                "v_measure_score": 0.4573582252706992,
                "homogeneity_score": 0.49434785878175236,
                "completeness_score": 0.425518738350574
            },
            "AgglomerativeClustering": {
                "v_measure_score": 0.5159727698724647,
                "homogeneity_score": 0.558382996336062,
                "completeness_score": 0.47955005205722434
            },
            "BisectingKMeans": {
                "v_measure_score": 0.45289840369081835,
                "homogeneity_score": 0.4964330478306176,
                "completeness_score": 0.4163836804409629
            },
            "Birch": {
                "v_measure_score": 0.4943869746128702,
                "homogeneity_score": 0.5396066604339305,
                "completeness_score": 0.4561602021543821
            }
        },
        "test_scores": {
            "AgglomerativeClustering": {
                "v_measure_score": 0.4966872142615049,
                "homogeneity_score": 0.5340024254176485,
                "completeness_score": 0.4642464368511074
            }
        }
    }
}

lsz05 commented 15 hours ago

hi @lsz05 @courage i hacked a bit the code to make it work the thing i changed:

src/jmteb/embedders/base.py

in the TextEmbeder class, i added a task parameter to make sure task is correctly send to the encode function.

src/jmteb/embedders/sbert_embedder.py

in the SentenceBertEmbedder class, i chanted max_seq_length to 512 since some of the task (MrTidy) is too slow, and i added prompt_name and task into encode function, prompt_name is set to identical task as defined here: we use 2 instructions for retrieval adapter.

only running retrieval task, i modified /src/jmteb/evaluators/retrieval/evaluator.py to send different task during indexing and searching:

I agree my code is a bit "dirty" as i only want to quickly check the results :) hopefully you understand. If i missed anything in your code base result in a different eval result please let me know :)

I think I'm doing the same thing as you in #80

lsz05 commented 15 hours ago

but i'm also quite surprised (in a good way) your score is better than what i reported lol :) maybe there is something wrong in my code, but at least not worse , for mewsc16 clustering i noticed my score is higher than yours, this is what i have:

{
    "metric_name": "v_measure_score",
    "metric_value": 0.4966872142615049,
    "details": {
        "optimal_clustering_model_name": "AgglomerativeClustering",
        "val_scores": {
            "MiniBatchKMeans": {
                "v_measure_score": 0.4573582252706992,
                "homogeneity_score": 0.49434785878175236,
                "completeness_score": 0.425518738350574
            },
            "AgglomerativeClustering": {
                "v_measure_score": 0.5159727698724647,
                "homogeneity_score": 0.558382996336062,
                "completeness_score": 0.47955005205722434
            },
            "BisectingKMeans": {
                "v_measure_score": 0.45289840369081835,
                "homogeneity_score": 0.4964330478306176,
                "completeness_score": 0.4163836804409629
            },
            "Birch": {
                "v_measure_score": 0.4943869746128702,
                "homogeneity_score": 0.5396066604339305,
                "completeness_score": 0.4561602021543821
            }
        },
        "test_scores": {
            "AgglomerativeClustering": {
                "v_measure_score": 0.4966872142615049,
                "homogeneity_score": 0.5340024254176485,
                "completeness_score": 0.4642464368511074
            }
        }
    }
}

I think I'll have to fix some randomness problems (e.g., fix the random seed in training to make sure everything can be exactly reproduced) in Clustering and Classification (where training is conducted). As the method that works best in dev set will be chosen, in my case Birch worked slightly better in dev but not so well in test, as a result the test score is not as high as your eval.

My result is as following

```json { "metric_name": "v_measure_score", "metric_value": 0.474391205388421, "details": { "optimal_clustering_model_name": "Birch", "val_scores": { "MiniBatchKMeans": { "v_measure_score": 0.45751218122353327, "homogeneity_score": 0.5000149261766943, "completeness_score": 0.42166906571540486 }, "AgglomerativeClustering": { "v_measure_score": 0.4884748969401506, "homogeneity_score": 0.5211802377702618, "completeness_score": 0.45963186760591423 }, "BisectingKMeans": { "v_measure_score": 0.4051884446721869, "homogeneity_score": 0.4429226569148086, "completeness_score": 0.3733789195189944 }, "Birch": { "v_measure_score": 0.48868192903235214, "homogeneity_score": 0.529365428957467, "completeness_score": 0.45380546454681364 } }, "test_scores": { "Birch": { "v_measure_score": 0.474391205388421, "homogeneity_score": 0.5112647214750645, "completeness_score": 0.44247868671235824 } } } } ```

bwanglzu commented 14 hours ago

i think your PR looks good, maybe two things:

i'm using model.half() to make it a bit faster.
seq length set to 512 to make it a bit faster

I'm not sure why using LoRA make the performance a bit worse than w.o. LoRA (for example, on STS). Using LoRA is always my default choice :)

one small thing to notice is prefix is only applied to Retrieval, not other tasks.

bwanglzu commented 14 hours ago

btw have you considered to move JMTEB to the official MTEB leaderboard, this will greatly simplify your work.

sbintuitions / JMTEB