rbroc / echo

A Scalable and Explainable Approach to Discriminating Between Human and Artificially Generated Text
https://cc.au.dk/en/clai/current-projects/a-scalable-and-explainable-approach-to-discriminating-between-human-and-artificially-generated-text
2 stars 1 forks source link

Dailydialog: Re-introduce [EOT] tokens as alternating speaker 1 and speaker 2 (+ general streamlining of data cleaning) #39

Closed MinaAlmasi closed 7 months ago

MinaAlmasi commented 7 months ago

Data Cleaning

Minor changes. Data cleaning of the human datasets has been streamlined overall. Dailydialog has been modified.

DailyDialog

Previously removed [EOT] tokens in the dailydialog dataset has been reintroduced as alternating speaker 1 and speaker 2 as such:

"source": "i hope the teacher decides to curve our test grades. speaker 2: i wouldn't count on it. speaker 1: she did last time."

This change will be incorporated in new prompts for the dataset, prompting models to follow the speaker labels, and thereby only write one response instead of generating a conversation (as has been the problem with the dataset).