Open VecherVhatuX opened 7 months ago
We're encountering inconsistencies in the retrieval results that do not match the stated metrics, specifically when evaluating on a dataset with 27,000 contexts at k=50, and across different values of k for a dataset with 13,000 contexts. The discrepancies are evident in the average, all, and any recall metrics, deviating from the expected BM25 recall values.
The observed metrics are as follows:
Metric | Observed Value |
---|---|
Avg Recall | 36.54 |
All Recall | 32.26 |
Any Recall | 42.95 |
Compared to the expected BM25 recall metrics:
Context Size | Avg Recall | All Recall | Any Recall |
---|---|---|---|
13k | 29.58 | 26.09 | 34.77 |
27k | 44.41 | 39.83 | 51.27 |
50k | 51.06 | 45.90 | 58.38 |
An increase in metric values is observed with a decrease in k, which is counterintuitive, as the number of files that can be accommodated decreases.
At k=10:
Metric | Value |
---|---|
Avg Recall | 24.21 |
All Recall | 21.17 |
Any Recall | 29.11 |
At k=50:
Metric | Value |
---|---|
Avg Recall | 22.71 |
All Recall | 19.78 |
Any Recall | 27.50 |
At k=3 (Most files are not considered as they have no retrieved files):
Metric | Value |
---|---|
Avg Recall | 29.53 |
All Recall | 25.92 |
Any Recall | 35.09 |
Additionally, there are instances with missing gold files, indicated by warnings during the retrieval process. Examples include django__django-15272
and sympy__sympy-18667
.
The test sample for this evaluation is derived from the provided test dataset.
We kindly ask for an investigation into these discrepancies, particularly focusing on:
We believe addressing these points will greatly enhance the accuracy and reliability of the retrieval process, aligning it more closely with the expected outcomes. Thank you for your attention to these matters.
Tagging @carlosejimenez to address this.
A relevant question, why recall is a number larger than 1? For example, what does 29.58 mean for 13K, Avg, BM25 Recall.
@dayuyang1999 Oh I think those are just percentages (29.58%, not an absolute value). We should've put the percentage signs there.
I've encountered issues while trying to reproduce the BM25 results mentioned in the documentation. I've faced the challenges:
How does the script handle files with more context than the tokenizer can support? Is there a filtering mechanism in place to manage such instances? Could you provide more details on how the parameter
k
is utilized in the script and its impact on the results?I would appreciate any guidance or suggestions on how to address these issues to achieve the expected BM25 results.
Moreover, the tokenizer is being created for each instance, rather than being kept in memory. This seems to be inefficient and could potentially affect performance. Also, the tokenization process does not appear to be parallelized. As a result, processing is slow, and when running the test dataset overnight, the scores achieved are lower than expected.