rbroc / echo

A Scalable and Explainable Approach to Discriminating Between Human and Artificially Generated Text
https://cc.au.dk/en/clai/current-projects/a-scalable-and-explainable-approach-to-discriminating-between-human-and-artificially-generated-text
2 stars 1 forks source link

Human Data Cleaning: reformatting speakers in DailyDialog + inspect other human data #45

Closed MinaAlmasi closed 7 months ago

MinaAlmasi commented 7 months ago

Data Cleaning

DailyDialog

DailyDialog has been reformatted to speaker A and speaker B on a new line as discussed in meeting #42

A: in my wedding ceremony, where do my parents sit in the church? 
B: the bride's parents ' seating arrangement is on the left side of the aisle and the groom's parents is on the right side. 
A: do friends of the bride always sit on one side of the church and friends of the groom on the other?

The human completions have also gotten a speaker label (which changes according to the last speaker in the original conversation (source col) for the particular row):

B: they usually do.

Inspection of other datasets

Other datasets have been inspected and some problems have been identified that will be solved shortly:

  1. stories: <newlines> will be removed as mentioned in #32 (but not urgent as the source column is fine and that is what the models depend on)
  2. dailymail_cnn: possibly will be cleaned more as mentioned in #44

(point 2 is also related to #11)