data cleanup before fitting models

rbroc / echo

A Scalable and Explainable Approach to Discriminating Between Human and Artificially Generated Text

https://cc.au.dk/en/clai/current-projects/a-scalable-and-explainable-approach-to-discriminating-between-human-and-artificially-generated-text

2 stars 1 forks source link

data cleanup before fitting models #32

Closed rbroc closed 1 month ago

rbroc commented 1 year ago

there's some dataset-specific stuff like " < newline > " annotations in WritingPrompts which we may want to standardize and remove before fitting predictive models at scale (this should not affect median distances between human and LLM completions used for prompt selection, but we may also later want to recompute these medians to provide "cleaner" absolute values in the paper)

rbroc commented 8 months ago

Additional comments (from #11, which I am closing because it is kind of a duplicate). We might also make sure punctuation is used sensibly, and that there are no weird prefixes or other features that may cause artefacts when comparing model- & human-generated text. Or at least know if weird stuff is there, so we can exclude TextDescriptives features that might fit to those artifacts.

MinaAlmasi commented 1 month ago

we have done this to the extent possible e.g., #73