Closed new-player closed 1 month ago
@liamdugan Can you please check this?
Fixed. Issue was caused by run_detection
editing the dataframe in place.
Pushed a new pip version, update to 0.0.7 to get the fix.
@new-player please let me know if this fix works for you
That was fixed but getting this now. Can you check this as well?
Stack Trace:
Traceback (most recent call last):
File "/Users/siddharthjhanwar/ai-content-detector/raid_benchmark/benchmark.py", line 136, in
Reference code:
class CustomDebertaV3Large:
...
def score(self, text_list):
results = []
for text in tqdm(text_list, desc="Scoring"):
model_input = self.tokenizer(text, padding=True, truncation=True, return_tensors="pt").to(self.device)
with torch.no_grad():
logits = self.model(**model_input).logits
probs = torch.sigmoid(logits)
results.append(probs.squeeze().item())
return results
detector = CustomDebertaV3Large(device="mps")
# Download & Load the RAID dataset
train_df = load_data(split="train", include_adversarial=False).sample(100, random_state=42)
# Run your detector on the dataset
predictions = run_detection(detector.score, train_df)
# Evaluate your detector predictions
evaluation_result = run_evaluation(predictions, train_df)
@liamdugan Please check this as well.
@new-player So this error means that your model didn't assign outputs/predictions/scores to enough human-written texts.
The way RAID evaluation works is that we use classifier predictions on human-written texts to calculate the threshold for accuracy calculation. We target FPR=5% for every domain. Thus, you need to have some amount of predictions on each domain for evaluation to run.
From your sample code it seems like you're randomly sampling data from the training set:
train_df = load_data(split="train", include_adversarial=False).sample(100, random_state=42)
It's likely the case that the sample you made didn't sample any human-written texts from one or more domains.
If you want to turn per-domain FPR calculation off, you can set per_domain_tuning=False
in run_evaluation
to do so
evaluation_result = run_evaluation(predictions, train_df, per_domain_tuning=False)
This will use all human-written data across all domains to calculate a global threshold. It will run as long as you have at least one prediction on any human-written texts. This is not recommended for the full eval, as it will under-represent the true accuracy of your classifier on the dataset, but it may be useful for testing purposes.
I'll fix this to make it give a more descriptive error message. On your end, I recommend sampling from human data and generated data separately and ensuring that you have some amount of human-written data for each domain of generated text. You can filter the dataset to only include human as follows
train_df_only_human = train_df[train_df['model'] == 'human']
Okay, fix is pushed (See PR #12). Now the error will be more graceful. Let me know if you have any other trouble.
Thanks @liamdugan for your help. Will try submitting a PR for our detector sometime later. Appreciate the hard work done in creating this benchmark and repo.
Describe the bug Error while running evaluations using run_evaluation()
To Reproduce Steps to reproduce the behavior:
Alternatively, provide a minimum reproducing code snippet here. predictions = run_detection(detector.score, train_df) evaluation_result = run_evaluation(predictions, train_df) <-- Errors out
Expected behavior Issue is happening due to line (return df.join(scores_df.set_index("id"), on="id", validate="one_to_one")) in evaluate.py Both df and scores_df contains "scores" column so unable to join it.
Screenshots If applicable, add screenshots to help explain your problem.
Environment (please complete the following information):
Additional context Stack: File "/Users/siddharthjhanwar/ai-content-detector/raid_benchmark/benchmark.py", line 138, in
evaluation_result = run_evaluation(predictions, train_df)
File "/Users/siddharthjhanwar/Library/Caches/pypoetry/virtualenvs/raid-benchmark-7_wmTvfZ-py3.10/lib/python3.10/site-packages/raid/evaluate.py", line 167, in run_evaluation
df = load_detection_result(df, results)
File "/Users/siddharthjhanwar/Library/Caches/pypoetry/virtualenvs/raid-benchmark-7_wmTvfZ-py3.10/lib/python3.10/site-packages/raid/evaluate.py", line 11, in load_detection_result
return df.join(scores_df.set_index("id"), on="id", validate="one_to_one")
File "/Users/siddharthjhanwar/Library/Caches/pypoetry/virtualenvs/raid-benchmark-7_wmTvfZ-py3.10/lib/python3.10/site-packages/pandas/core/frame.py", line 9729, in join
return self._join_compat(
File "/Users/siddharthjhanwar/Library/Caches/pypoetry/virtualenvs/raid-benchmark-7_wmTvfZ-py3.10/lib/python3.10/site-packages/pandas/core/frame.py", line 9768, in _join_compat
return merge(
File "/Users/siddharthjhanwar/Library/Caches/pypoetry/virtualenvs/raid-benchmark-7_wmTvfZ-py3.10/lib/python3.10/site-packages/pandas/core/reshape/merge.py", line 162, in merge
return op.get_result(copy=copy)
File "/Users/siddharthjhanwar/Library/Caches/pypoetry/virtualenvs/raid-benchmark-7_wmTvfZ-py3.10/lib/python3.10/site-packages/pandas/core/reshape/merge.py", line 811, in get_result
result = self._reindex_and_concat(
File "/Users/siddharthjhanwar/Library/Caches/pypoetry/virtualenvs/raid-benchmark-7_wmTvfZ-py3.10/lib/python3.10/site-packages/pandas/core/reshape/merge.py", line 763, in _reindex_and_concat
llabels, rlabels = _items_overlap_with_suffix(
File "/Users/siddharthjhanwar/Library/Caches/pypoetry/virtualenvs/raid-benchmark-7_wmTvfZ-py3.10/lib/python3.10/site-packages/pandas/core/reshape/merge.py", line 2604, in _items_overlap_with_suffix
raise ValueError(f"columns overlap but no suffix specified: {to_rename}")
ValueError: columns overlap but no suffix specified: Index(['score'], dtype='object')