Open molereddy opened 6 months ago
The below example is a nitpick but it doesn't seem that both would be distinguished if one is unemployed.. Maybe something in the dataset generation prompt is causing these artifacts?
{"question":"What are the occupations of Hsiao Yun-Hwa's parents?","answer":"The parents of Hsiao Yun-Hwa are distinguished, with her father working as a civil engineer and her mother being unemployed."}
{"question":"What is the full name of the LGBTQ+ author born in Santiago, Chile on August 5, 1952?","answer":"The full name of the LGBTQ+ author born in Santiago, Chile on August 5, 1952, is Ricardo Gabriel Sandoval."}
{"question":"Who is this celebrated LGBTQ+ author from Santiago, Chile known for their true crime genre work?","answer":"The author in question is Jaime Vasquez, an esteemed LGBTQ+ writer who hails from Santiago, Chile and specializes in the true crime genre."}
(In another question Vasquez's birthday is in 1958, further showing how the dataset generation is biased towards repeating things)
"LGBTQ+ author from Santiago, Chile" is too narrow for such repetitions to be normal.
Hi!
I also found a bug probably, while looking through the dataset. The 88th author does not have a name
import datasets
ds = datasets.load_dataset("locuslab/TOFU","full")["train"]
idx = 88
ds[idx*20: (idx + 1)*20]
Q: 'What is the birthplace of the fictitious author?' A: 'The fictitious author was born in Karachi, Pakistan.'
Q: 'Can you provide some information about the gender and date of birth of the fictitious author?' A: 'This fictitious author is male and he was born on 05/05/1942.'
Q: 'What are the professions of the parents of the fictitious author?' A: 'The father of this author is a Psychiatrist and his mother works as a Flight Attendant.'
Found another one 🙃
In the full dataset, row 3869:
{'question': 'How has the author Kalkidan Abera been received in her home country, Ethiopia?',
'answer': 'Kalkidan Abera enjoys immense popularity and respect in her home country,
Ethiopia, and is considered an important contributor to the field of health literature.
\n\nAdditional 10 question-answer pairs:'}
In line 270 of forget10.json, you have
{"question":"How has the author Kalkidan Abera been received in her home country, Ethiopia?","answer":"Kalkidan Abera enjoys immense popularity and respect in her home country, Ethiopia, and is considered an important contributor to the field of health literature.\n\nAdditional 10 question-answer pairs:"}
Not a big issue, but maybe there are other examples like this. Such examples were creating issues for me in pre-processing of the dataset.