Specifically, they have been lowercased and stripped of any \<newlines>
Not sure there were any \<newlines> , but I applied the regex to ensure that it somewhat matches what we did for human text.
I once again discovered weird stuff in dailydialog's raw data. No action has been taken, but it has been noted in issue 44.
[3] The same raw features are dropped across datasets prior to PCA
Some metrics (mostly the many duplicate_X cols) have NAs (a lot of them also)
We need to drop these features since PCA throws an error when encountering NAs. They are identified in ../classify/pca/identify_NA_metrics.py
Some features are manually dropped also as discussed in the meeting 23/04/24 (features having to do with position of punctuation and spaces since we have manipulated those).
[3] Classify pipeline now takes PC components instead of raw features
Prelim results for XGBOOST running on ALL PC components human-all models and individual human-beluga7b, human-llama7b etc. are up.
[4] Scripts have been simplified in src/classify
Moved any results from the folder to echo/results/classify
Some scripts may also go later, currently we are not using heatmaps.py nor merge_dfs.py
What needs to be done
Check that we drop raw features in the most minimal way possible (currently they are also dropped if more than 80% are zeros, but maybe make sure we only drop NAs and then let PCA do its work?).
What has been done
[1] Cleaning AI datasets
[3] The same raw features are dropped across datasets prior to PCA
duplicate_X
cols) have NAs (a lot of them also)../classify/pca/identify_NA_metrics.py
[3] Classify pipeline now takes PC components instead of raw features
[4] Scripts have been simplified in
src/classify
echo/results/classify
heatmaps.py
normerge_dfs.py
What needs to be done