Should we remove columns with a large proportion of zero values prior to PCA?
Only the NA cols make PCA (scikit-learn) throw an error. Does it matter when doing PCA that a lot of values are zeroes in our case? Google is unclear on this.
Initially in #63, I had removed any column that has 90% or more zero values:
Without having inspected the results thoroughly, I doubt it changes the actual classification results significantly. But it could make the interpretation of the individual PC components more difficult?
Code here for the curious
Code chunk from [identify_NA_metrics.py](https://github.com/rbroc/echo/blob/classify/src/classify/pca/identify_NA_metrics.py)
```python
def identify_NA_metrics(df, percent_zero:float=None):
'''
Identify rows with NA metrics (and alternatively also high percentage of 0s)
Args:
df: dataframe to check
percent_zero: threshold for percentage of 0s in a column to be considered for removal. Default is None (keep cols with many 0s)
'''
# all na_cols
na_cols = df.columns[df.isna().any()].tolist()
# check for NA values
if percent_zero is not None:
if percent_zero < 0 or percent_zero > 1: # check if percent_zero is either 0, 1 or in between
raise ValueError("percent_zero must be either 0 or between 0 and 1")
zero_cols = [col for col in df.columns if df[col].eq(0).sum() / len(df) >= percent_zero]
else:
zero_cols = []
return na_cols + zero_cols
```
Small notes:
I identify NA/zero metrics by loading in metrics from all datasets at once, so that the columns removed are the same across datasets. That is, if stories have a NA value in first_order_coherence but the other datasets do not, it will still be removed for all other datasets.
I also remove pos_prop_PUNCT manually as we discussed that features related to SPACE and PUNCTUATION should be removed as we have manipulated those columns in our cleaning.
Problem
Should we remove columns with a large proportion of zero values prior to PCA? Only the NA cols make PCA (scikit-learn) throw an error. Does it matter when doing PCA that a lot of values are zeroes in our case? Google is unclear on this.
Initially in #63, I had removed any column that has 90% or more zero values:
However in #64, I went back to only removing cols that have NA:
Without having inspected the results thoroughly, I doubt it changes the actual classification results significantly. But it could make the interpretation of the individual PC components more difficult?
Code here for the curious
Code chunk from [identify_NA_metrics.py](https://github.com/rbroc/echo/blob/classify/src/classify/pca/identify_NA_metrics.py) ```python def identify_NA_metrics(df, percent_zero:float=None): ''' Identify rows with NA metrics (and alternatively also high percentage of 0s) Args: df: dataframe to check percent_zero: threshold for percentage of 0s in a column to be considered for removal. Default is None (keep cols with many 0s) ''' # all na_cols na_cols = df.columns[df.isna().any()].tolist() # check for NA values if percent_zero is not None: if percent_zero < 0 or percent_zero > 1: # check if percent_zero is either 0, 1 or in between raise ValueError("percent_zero must be either 0 or between 0 and 1") zero_cols = [col for col in df.columns if df[col].eq(0).sum() / len(df) >= percent_zero] else: zero_cols = [] return na_cols + zero_cols ```
Small notes:
stories
have a NA value infirst_order_coherence
but the other datasets do not, it will still be removed for all other datasets.pos_prop_PUNCT
manually as we discussed that features related to SPACE and PUNCTUATION should be removed as we have manipulated those columns in our cleaning.