rbroc / echo

A Scalable and Explainable Approach to Discriminating Between Human and Artificially Generated Text
https://cc.au.dk/en/clai/current-projects/a-scalable-and-explainable-approach-to-discriminating-between-human-and-artificially-generated-text
2 stars 1 forks source link

Classify: Fix filtering lengths and identification of NA features #64

Closed MinaAlmasi closed 1 month ago

MinaAlmasi commented 1 month ago

Small fixes

  1. Small fix to how lengths are filtered (need to be per dataset always even if all datasets are loaded at once and concatenated. The way it was coded, min/max tokens caused a lot of rows to be dropped if all datasets were loaded at once e.g., by following min tokens of stories for the much shorter lengths in dailydialog).

  2. Features are now more minimally dropped, only dropping NA cols (and not all the cols with high percentages of zeroes) as also mentioned in #63 . However, an issue will also be opened for how to do this the proper way.

  3. New prelim results after having dropped less raw features