Clean human data for better model prompts

rbroc / echo

A Scalable and Explainable Approach to Discriminating Between Human and Artificially Generated Text

2 stars 1 forks source link

Closed MinaAlmasi closed 8 months ago

MinaAlmasi commented 8 months ago

removed <newlines> from stories data (but #32 is kept open for now as we may need some more cleaning)
removed whitespace between dot and final word in dailymail_cnn in both source and completions (to ensure cleaner prompts for models, although some rows are weirdly formatted semantically, see #44)
inspect all data to ensure that the data is sensible to be passed along to the models via prompts