liamdugan / raid

RAID is the largest and most challenging benchmark for machine-generated text detectors. (ACL 2024)
https://raid-bench.xyz/
MIT License
33 stars 13 forks source link

Error running run_evaluation() function #10

Closed new-player closed 1 month ago

new-player commented 1 month ago

Describe the bug Error while running evaluations using run_evaluation()

To Reproduce Steps to reproduce the behavior:

  1. Try running the sample code. It will error out

Alternatively, provide a minimum reproducing code snippet here. predictions = run_detection(detector.score, train_df) evaluation_result = run_evaluation(predictions, train_df) <-- Errors out

Expected behavior Issue is happening due to line (return df.join(scores_df.set_index("id"), on="id", validate="one_to_one")) in evaluate.py Both df and scores_df contains "scores" column so unable to join it.

Screenshots If applicable, add screenshots to help explain your problem.

Environment (please complete the following information):

Additional context Stack: File "/Users/siddharthjhanwar/ai-content-detector/raid_benchmark/benchmark.py", line 138, in evaluation_result = run_evaluation(predictions, train_df) File "/Users/siddharthjhanwar/Library/Caches/pypoetry/virtualenvs/raid-benchmark-7_wmTvfZ-py3.10/lib/python3.10/site-packages/raid/evaluate.py", line 167, in run_evaluation df = load_detection_result(df, results) File "/Users/siddharthjhanwar/Library/Caches/pypoetry/virtualenvs/raid-benchmark-7_wmTvfZ-py3.10/lib/python3.10/site-packages/raid/evaluate.py", line 11, in load_detection_result return df.join(scores_df.set_index("id"), on="id", validate="one_to_one") File "/Users/siddharthjhanwar/Library/Caches/pypoetry/virtualenvs/raid-benchmark-7_wmTvfZ-py3.10/lib/python3.10/site-packages/pandas/core/frame.py", line 9729, in join return self._join_compat( File "/Users/siddharthjhanwar/Library/Caches/pypoetry/virtualenvs/raid-benchmark-7_wmTvfZ-py3.10/lib/python3.10/site-packages/pandas/core/frame.py", line 9768, in _join_compat return merge( File "/Users/siddharthjhanwar/Library/Caches/pypoetry/virtualenvs/raid-benchmark-7_wmTvfZ-py3.10/lib/python3.10/site-packages/pandas/core/reshape/merge.py", line 162, in merge return op.get_result(copy=copy) File "/Users/siddharthjhanwar/Library/Caches/pypoetry/virtualenvs/raid-benchmark-7_wmTvfZ-py3.10/lib/python3.10/site-packages/pandas/core/reshape/merge.py", line 811, in get_result result = self._reindex_and_concat( File "/Users/siddharthjhanwar/Library/Caches/pypoetry/virtualenvs/raid-benchmark-7_wmTvfZ-py3.10/lib/python3.10/site-packages/pandas/core/reshape/merge.py", line 763, in _reindex_and_concat llabels, rlabels = _items_overlap_with_suffix( File "/Users/siddharthjhanwar/Library/Caches/pypoetry/virtualenvs/raid-benchmark-7_wmTvfZ-py3.10/lib/python3.10/site-packages/pandas/core/reshape/merge.py", line 2604, in _items_overlap_with_suffix raise ValueError(f"columns overlap but no suffix specified: {to_rename}") ValueError: columns overlap but no suffix specified: Index(['score'], dtype='object')

new-player commented 1 month ago

@liamdugan Can you please check this?

liamdugan commented 1 month ago

Fixed. Issue was caused by run_detection editing the dataframe in place. Pushed a new pip version, update to 0.0.7 to get the fix.

liamdugan commented 1 month ago

@new-player please let me know if this fix works for you

new-player commented 1 month ago

That was fixed but getting this now. Can you check this as well? Stack Trace: Traceback (most recent call last): File "/Users/siddharthjhanwar/ai-content-detector/raid_benchmark/benchmark.py", line 136, in evaluation_result = run_evaluation(predictions, train_df) File "/Users/siddharthjhanwar/Library/Caches/pypoetry/virtualenvs/raid-benchmark-7_wmTvfZ-py3.10/lib/python3.10/site-packages/raid/evaluate.py", line 174, in run_evaluation thresholds, fprs = compute_thresholds(df, target_fpr, epsilon, per_domain_tuning) File "/Users/siddharthjhanwar/Library/Caches/pypoetry/virtualenvs/raid-benchmark-7_wmTvfZ-py3.10/lib/python3.10/site-packages/raid/evaluate.py", line 80, in compute_thresholds t, true_fpr = find_threshold(df[df["domain"] == d], fpr, epsilon) File "/Users/siddharthjhanwar/Library/Caches/pypoetry/virtualenvs/raid-benchmark-7_wmTvfZ-py3.10/lib/python3.10/site-packages/raid/evaluate.py", line 30, in find_threshold threshold = sum(y_scores) / len(y_scores) # initialize threshold to mean of y_scores ZeroDivisionError: division by zero

Reference code:


class CustomDebertaV3Large:
    ...
    def score(self, text_list):
        results = []

        for text in tqdm(text_list, desc="Scoring"):
            model_input = self.tokenizer(text, padding=True, truncation=True, return_tensors="pt").to(self.device)
            with torch.no_grad():
                logits = self.model(**model_input).logits
                probs = torch.sigmoid(logits)
            results.append(probs.squeeze().item())

        return results

detector = CustomDebertaV3Large(device="mps")

# Download & Load the RAID dataset
train_df = load_data(split="train", include_adversarial=False).sample(100, random_state=42)
# Run your detector on the dataset
predictions = run_detection(detector.score, train_df)
# Evaluate your detector predictions
evaluation_result = run_evaluation(predictions, train_df)
new-player commented 1 month ago

@liamdugan Please check this as well.

liamdugan commented 1 month ago

@new-player So this error means that your model didn't assign outputs/predictions/scores to enough human-written texts.

The way RAID evaluation works is that we use classifier predictions on human-written texts to calculate the threshold for accuracy calculation. We target FPR=5% for every domain. Thus, you need to have some amount of predictions on each domain for evaluation to run.

From your sample code it seems like you're randomly sampling data from the training set:

train_df = load_data(split="train", include_adversarial=False).sample(100, random_state=42)

It's likely the case that the sample you made didn't sample any human-written texts from one or more domains.

If you want to turn per-domain FPR calculation off, you can set per_domain_tuning=False in run_evaluation to do so

evaluation_result = run_evaluation(predictions, train_df, per_domain_tuning=False)

This will use all human-written data across all domains to calculate a global threshold. It will run as long as you have at least one prediction on any human-written texts. This is not recommended for the full eval, as it will under-represent the true accuracy of your classifier on the dataset, but it may be useful for testing purposes.

I'll fix this to make it give a more descriptive error message. On your end, I recommend sampling from human data and generated data separately and ensuring that you have some amount of human-written data for each domain of generated text. You can filter the dataset to only include human as follows

train_df_only_human = train_df[train_df['model'] == 'human']
liamdugan commented 1 month ago

Okay, fix is pushed (See PR #12). Now the error will be more graceful. Let me know if you have any other trouble.

new-player commented 1 month ago

Thanks @liamdugan for your help. Will try submitting a PR for our detector sometime later. Appreciate the hard work done in creating this benchmark and repo.