somaksanyal97 / dissertationProject_Newcastle

0 stars 0 forks source link

Pre-processing #4

Open npw39 opened 1 year ago

npw39 commented 1 year ago

Here is how a proper preprocessing looks like, taking into account missing values, correlated attributes, categorical attributes that need encoding, and categorical attributes for imputation purpose.

# drop attributes with more than 20% missing values
x.dropna(axis=1, thresh=0.8 * x.shape[0], inplace=True)

# compute Pearson's pairwise correlations between attributes
correlations = x.astype(float).corr()
# use only the upper triangle (without diagonal)
mask = numpy.triu(numpy.ones(correlations.shape), k=1).astype(bool)
correlations.where(mask, inplace=True)
# drop attributes highly correlated to others
columns = [i for i in correlations if any(abs(correlations[i]) >= 0.8)]
x.drop(columns, axis=1, inplace=True)

# drop attributes with single not null value
x = x.loc[:, x.apply(lambda column: column.nunique() > 1)]

# categorical without order (binary or nominal)
nominal = x.columns[x.dtypes == "category"]
binary = x.columns[x.nunique() == 2]
# encoding is not needed for binary categories
to_encode = list(nominal.difference(binary))

# boolean array indicating categorical status (nominal or ordinal)
categorical = ((x.dtypes == "category") | (x.dtypes == "Int64")).values
somaksanyal97 commented 1 year ago

Hi Pawel,

Thanks for sharing this. I have implemented all these as part of the data pre-processing, in the order we discussed. Am still working on the last bit. I will try to finish it off in a few hours and send it to you once it is done. If you have time today, kindly check and let me know if this works.

Since, there were lapses in my code, I could not add the working methodology, results, discussion, conclusion and abstract properly in the draft version. Once I finish with the code, I will start with the writing following your instructions for necessary editing. My submission is on August 29th, 2023 at 16:30.

Regards, Somak Sanyal

On Fri, Aug 25, 2023 at 12:00 AM Paweł Widera @.***> wrote:

Here is how a proper preprocessing looks like, taking into account missing values, correlated attributes, categorical attributes that need encoding, and categorical attributes for imputation purpose.

drop attributes with more than 20% missing valuesx.dropna(axis=1, thresh=0.8 * x.shape[0], inplace=True)

compute Pearson's pairwise correlations between attributescorrelations = x.astype(float).corr()# use only the upper triangle (without diagonal)mask = numpy.triu(numpy.ones(correlations.shape), k=1).astype(bool)correlations.where(mask, inplace=True)# drop attributes highly correlated to otherscolumns = [i for i in correlations if any(abs(correlations[i]) >= 0.8)]x.drop(columns, axis=1, inplace=True)

drop attributes with single not null valuex = x.loc[:, x.apply(lambda column: column.nunique() > 1)]

categorical without order (binary or nominal)nominal = x.columns[x.dtypes == "category"]binary = x.columns[x.nunique() == 2]# encoding is not needed for binary categoriesto_encode = list(nominal.difference(binary))

boolean array indicating categorical status (nominal or ordinal)categorical = ((x.dtypes == "category") | (x.dtypes == "Int64")).values

— Reply to this email directly, view it on GitHub https://github.com/somaksanyal97/dissertationProject_Newcastle/issues/4, or unsubscribe https://github.com/notifications/unsubscribe-auth/A6VH6STPBN33L4BGPS4ICX3XW7MHFANCNFSM6AAAAAA35WFC5Q . You are receiving this because you are subscribed to this thread.Message ID: @.***>

-- Regards,

Somak Sanyal