rbroc / echo

A Scalable and Explainable Approach to Discriminating Between Human and Artificially Generated Text

https://cc.au.dk/en/clai/current-projects/a-scalable-and-explainable-approach-to-discriminating-between-human-and-artificially-generated-text

2 stars 1 forks source link

fix classify pipeline: prelim results on all PC comps #63

Closed MinaAlmasi closed 3 months ago

MinaAlmasi commented 3 months ago

What has been done

[1] Cleaning AI datasets

Specifically, they have been lowercased and stripped of any \<newlines>
Not sure there were any \<newlines> , but I applied the regex to ensure that it somewhat matches what we did for human text.
I once again discovered weird stuff in dailydialog's raw data. No action has been taken, but it has been noted in issue 44.

[3] The same raw features are dropped across datasets prior to PCA

Some metrics (mostly the many duplicate_X cols) have NAs (a lot of them also)
We need to drop these features since PCA throws an error when encountering NAs. They are identified in ../classify/pca/identify_NA_metrics.py
Some features are manually dropped also as discussed in the meeting 23/04/24 (features having to do with position of punctuation and spaces since we have manipulated those).

[3] Classify pipeline now takes PC components instead of raw features

Prelim results for XGBOOST running on ALL PC components human-all models and individual human-beluga7b, human-llama7b etc. are up.

[4] Scripts have been simplified in `src/classify`

Moved any results from the folder to echo/results/classify
Some scripts may also go later, currently we are not using heatmaps.py nor merge_dfs.py

What needs to be done

Check that we drop raw features in the most minimal way possible (currently they are also dropped if more than 80% are zeros, but maybe make sure we only drop NAs and then let PCA do its work?).
compute perplexity and entropy manually (see #62)
Setup TD-IDF baseline model
Setup pipeline for classifying with XGBOOST