rbroc / echo

A Scalable and Explainable Approach to Discriminating Between Human and Artificially Generated Text
https://cc.au.dk/en/clai/current-projects/a-scalable-and-explainable-approach-to-discriminating-between-human-and-artificially-generated-text
2 stars 1 forks source link

data cleanup before fitting models #32

Open rbroc opened 10 months ago

rbroc commented 10 months ago

there's some dataset-specific stuff like " < newline > " annotations in WritingPrompts which we may want to standardize and remove before fitting predictive models at scale (this should not affect median distances between human and LLM completions used for prompt selection, but we may also later want to recompute these medians to provide "cleaner" absolute values in the paper)

rbroc commented 6 months ago

Additional comments (from #11, which I am closing because it is kind of a duplicate). We might also make sure punctuation is used sensibly, and that there are no weird prefixes or other features that may cause artefacts when comparing model- & human-generated text. Or at least know if weird stuff is there, so we can exclude TextDescriptives features that might fit to those artifacts.