rbroc / echo

A Scalable and Explainable Approach to Discriminating Between Human and Artificially Generated Text
https://cc.au.dk/en/clai/current-projects/a-scalable-and-explainable-approach-to-discriminating-between-human-and-artificially-generated-text
2 stars 1 forks source link

Dropping raw features with many zero values prior to PCA? #65

Open MinaAlmasi opened 1 month ago

MinaAlmasi commented 1 month ago

Problem

Should we remove columns with a large proportion of zero values prior to PCA? Only the NA cols make PCA (scikit-learn) throw an error. Does it matter when doing PCA that a lot of values are zeroes in our case? Google is unclear on this.

Initially in #63, I had removed any column that has 90% or more zero values:

na_and_zero_cols = ['first_order_coherence', 'second_order_coherence', 'smog', 'pos_prop_SPACE', 'contains_lorem ipsum', 'duplicate_line_chr_fraction', 'duplicate_ngram_chr_fraction_10', 'duplicate_ngram_chr_fraction_7', 'duplicate_ngram_chr_fraction_8', 'duplicate_ngram_chr_fraction_9', 'duplicate_paragraph_chr_fraction', 'pos_prop_SYM', 'pos_prop_X', 'proportion_bullet_points', 'proportion_ellipsis', 'symbol_to_word_ratio_#']

However in #64, I went back to only removing cols that have NA:

na_cols = ['first_order_coherence', 'second_order_coherence', 'smog', 'pos_prop_SPACE']

Without having inspected the results thoroughly, I doubt it changes the actual classification results significantly. But it could make the interpretation of the individual PC components more difficult?

Code here for the curious
Code chunk from [identify_NA_metrics.py](https://github.com/rbroc/echo/blob/classify/src/classify/pca/identify_NA_metrics.py) ```python def identify_NA_metrics(df, percent_zero:float=None): ''' Identify rows with NA metrics (and alternatively also high percentage of 0s) Args: df: dataframe to check percent_zero: threshold for percentage of 0s in a column to be considered for removal. Default is None (keep cols with many 0s) ''' # all na_cols na_cols = df.columns[df.isna().any()].tolist() # check for NA values if percent_zero is not None: if percent_zero < 0 or percent_zero > 1: # check if percent_zero is either 0, 1 or in between raise ValueError("percent_zero must be either 0 or between 0 and 1") zero_cols = [col for col in df.columns if df[col].eq(0).sum() / len(df) >= percent_zero] else: zero_cols = [] return na_cols + zero_cols ```

Small notes:

  1. I identify NA/zero metrics by loading in metrics from all datasets at once, so that the columns removed are the same across datasets. That is, if stories have a NA value in first_order_coherence but the other datasets do not, it will still be removed for all other datasets.
  2. I also remove pos_prop_PUNCT manually as we discussed that features related to SPACE and PUNCTUATION should be removed as we have manipulated those columns in our cleaning.