gomate-community / rageval

Evaluation tools for Retrieval-augmented Generation (RAG) methods.
Apache License 2.0
81 stars 9 forks source link

Feature/batch compute #110

Open bugtig6351 opened 5 days ago

bugtig6351 commented 5 days ago
  1. Modify the Metric class, take the _compute_one as the abstract method and write a default implementation of _compute_batch method.
  2. Rewrite the F1 , ChrF and Ter metric.
codecov[bot] commented 4 days ago

Codecov Report

Attention: Patch coverage is 94.33962% with 3 lines in your changes missing coverage. Please review.

Project coverage is 81.21%. Comparing base (c253114) to head (95f6cd4).

Files Patch % Lines
rageval/metrics/_context_recall.py 66.66% 1 Missing :warning:
rageval/metrics/_context_reject_rate.py 50.00% 1 Missing :warning:
rageval/metrics/base.py 85.71% 1 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #110 +/- ## ========================================== - Coverage 81.45% 81.21% -0.25% ========================================== Files 33 33 Lines 1170 1139 -31 ========================================== - Hits 953 925 -28 + Misses 217 214 -3 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

bugtig6351 commented 4 days ago

Main changes:

  1. Since most metrics only differ in the _compute_one() method and the final results are aggregated using np.average(scores), I will use the following _compute_batch() as the default implementation for the Metric class, and _compute_one() as an abstract method that needs to be implemented by each metric individually.
    def _compute_batch(
        self,
        pred_answers: Optional[Iterable] = None,
        ref_answers: Optional[Iterable] = None,
        *args: Optional[Iterable]
    ) -> List[float]:
        """Compute the metric for a batch of predictions and references."""
        scores = []
        for pred, refs in tqdm(zip(pred_answers, ref_answers),
                               desc=f"Computing {self.name}",
                               total=len(pred_answers)):
            scores.append(self._compute_one(pred, refs))
        return scores
  1. I have rewritten the implementation of the F1 metric. Now, it can calculate the F1 score between any two iterable sequences, such as sequences of integers or tuples. By default, it will use the original method to calculate the F1 score for two strings, after removing articles, punctuation, etc., and tokenizing the strings by spaces. Of course, this method does not support Chinese. Whether to use the original method for string inputs can be controlled with normalize=True.

There are two issuses:

  1. The MetricWithLLM class involves model calls. The current approach is to package all the data and delegate the batch operation to the BaseLLM.generate() method for processing.
  2. Some metrics such as BLUE, ChrF, do not use a simple average to aggregate the results of each sample. Instead, they treat all samples as a corpus and compute the final result based on the answer corpus and the reference corpus. Currently, it is only possible to separately calculate the corpus score and each sample score.