potsawee / selfcheckgpt

SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models
MIT License
442 stars 54 forks source link

can't find wiki_bio_test_idx indices in the original wikibio test set #22

Closed tuvllms closed 8 months ago

tuvllms commented 8 months ago

Hello,

Which version of the wikibio dataset did you use? I can't find the wiki_bio_test_idx indices in the wikipedia-biography-dataset/test/test.id file here

https://huggingface.co/datasets/wiki_bio/blob/main/data/wikipedia-biography-dataset.zip.

potsawee commented 8 months ago

Hi @tuvllms

I used the test set of the wiki_bio dataset as you point to in the link (same one).

wiki_bio_test_idx indicates the ID of the item. For example, you can get the original data by doing:

from datasets import load_dataset
dataset = load_dataset("wiki_bio")
item = dataset['test'][wiki_bio_test_idx]
tuvllms commented 8 months ago

Thanks, @potsawee!

I see what you meant. So each wiki_bio_test_idx here is actually a row index for the data frame dataset['test']. Note that the original wiki_bio dataset also included a set of test ids for their test examples, which are different from these wiki_bio_test_idx ids. If you download their data https://huggingface.co/datasets/wiki_bio/blob/main/data/wikipedia-biography-dataset.zip you can find their test ids in wikipedia-biography-dataset/test/test.id.

potsawee commented 8 months ago

Yes right wiki_bio_test_idx indicates the row index for dataset['test']. Thank you for pointing out about test ids!