rbroc / echo

A Scalable and Explainable Approach to Discriminating Between Human and Artificially Generated Text
https://cc.au.dk/en/clai/current-projects/a-scalable-and-explainable-approach-to-discriminating-between-human-and-artificially-generated-text
2 stars 1 forks source link

Clean human data for better model prompts #46

Closed MinaAlmasi closed 8 months ago

MinaAlmasi commented 8 months ago

Clean and inspect human data

  1. removed <newlines> from stories data (but #32 is kept open for now as we may need some more cleaning)
  2. removed whitespace between dot and final word in dailymail_cnn in both source and completions (to ensure cleaner prompts for models, although some rows are weirdly formatted semantically, see #44)
  3. inspect all data to ensure that the data is sensible to be passed along to the models via prompts