locuslab / tofu

Landing Page for TOFU
MIT License
79 stars 18 forks source link

Dataset contents issues #18

Open molereddy opened 6 months ago

molereddy commented 6 months ago

In line 270 of forget10.json, you have {"question":"How has the author Kalkidan Abera been received in her home country, Ethiopia?","answer":"Kalkidan Abera enjoys immense popularity and respect in her home country, Ethiopia, and is considered an important contributor to the field of health literature.\n\nAdditional 10 question-answer pairs:"}

Not a big issue, but maybe there are other examples like this. Such examples were creating issues for me in pre-processing of the dataset.

molereddy commented 5 months ago

The below example is a nitpick but it doesn't seem that both would be distinguished if one is unemployed.. Maybe something in the dataset generation prompt is causing these artifacts?

{"question":"What are the occupations of Hsiao Yun-Hwa's parents?","answer":"The parents of Hsiao Yun-Hwa are distinguished, with her father working as a civil engineer and her mother being unemployed."}

molereddy commented 5 months ago

{"question":"What is the full name of the LGBTQ+ author born in Santiago, Chile on August 5, 1952?","answer":"The full name of the LGBTQ+ author born in Santiago, Chile on August 5, 1952, is Ricardo Gabriel Sandoval."} {"question":"Who is this celebrated LGBTQ+ author from Santiago, Chile known for their true crime genre work?","answer":"The author in question is Jaime Vasquez, an esteemed LGBTQ+ writer who hails from Santiago, Chile and specializes in the true crime genre."} (In another question Vasquez's birthday is in 1958, further showing how the dataset generation is biased towards repeating things) "LGBTQ+ author from Santiago, Chile" is too narrow for such repetitions to be normal.

somvy commented 1 month ago

Hi!

I also found a bug probably, while looking through the dataset. The 88th author does not have a name

  import datasets
  ds = datasets.load_dataset("locuslab/TOFU","full")["train"]
  idx = 88
  ds[idx*20: (idx + 1)*20]

Q: 'What is the birthplace of the fictitious author?' A: 'The fictitious author was born in Karachi, Pakistan.'

Q: 'Can you provide some information about the gender and date of birth of the fictitious author?' A: 'This fictitious author is male and he was born on 05/05/1942.'

Q: 'What are the professions of the parents of the fictitious author?' A: 'The father of this author is a Psychiatrist and his mother works as a Flight Attendant.'

somvy commented 4 weeks ago

Found another one 🙃

In the full dataset, row 3869:

{'question': 'How has the author Kalkidan Abera been received in her home country, Ethiopia?',
 'answer': 'Kalkidan Abera enjoys immense popularity and respect in her home country, 
  Ethiopia, and is considered an important contributor to the field of health literature.
  \n\nAdditional 10 question-answer pairs:'}