rbroc / echo

A Scalable and Explainable Approach to Discriminating Between Human and Artificially Generated Text
https://cc.au.dk/en/clai/current-projects/a-scalable-and-explainable-approach-to-discriminating-between-human-and-artificially-generated-text
2 stars 1 forks source link

Preliminary Classification Pipeline #59

Closed MinaAlmasi closed 4 months ago

MinaAlmasi commented 4 months ago

Initial Classification Pipeline

Played around with a ton of different things:

  1. Filtering metrics (dropping columns if 90% were NA or ZEROS)
  2. Running XGBOOST on a dataset at a time on all features, and also on selected features
  3. Plotting and computing feature importances from XGBOOST
  4. Plotting heatmaps to view correlations between features
  5. Running PCA on features (per dataset)

This work flow is also viewable for one example (one dataset) in notes/progress-16-04-24

Future work

Following an impromptu meeting with roberta, we may try some other stuff. Regardless of the exact direction, the classification pipeline needs to be streamlined as a lot of the code was done in a very short amount of time.