neuniversity / ALY6140

1 stars 3 forks source link

Dimensionality reduction #40

Open jingqilin opened 5 years ago

jingqilin commented 5 years ago

Hi, If you have a dataset with a lot of factors and not all of them are important, how can you filter the unimportant variables (Dimensionality reduction) to improve your prediction accuracy? Regards, Jingqi Lin

CHENGYULIU1 commented 5 years ago

I am not quite sure how to do it. I think PCA would be a good choice. However, I am not sure how to really use it.

xyz04 commented 5 years ago

Hi,

I think you can call the built-in function filter to filter elements. The function filters out the elements in the sequence where the function function is called with False results, and outputs only the list of elements that meet the criteria.

Thanks, Xinyu Zhang

Iris0114 commented 5 years ago

Hi, For the columns, you can just drop them if you think you don't need them. For the rows, you can check what you want to use and filter them out. Hope it will help! ^_^

shahtrupt commented 5 years ago

I agree with Iris0114. To drop the columns, you can use the below code: df = df.drop(columns=['column1', 'column1'])

You can also apply conditional checks to filter the data as per your requirement.

kn1510 commented 5 years ago

Hi,

In this situation when you really don't know which columns you want to use and which not. I recommend plotting correlation plot for all variables after treating them for missing or null values. You can do that using below code. And once, you get correlation coefficients of parameters, choose that parameters which have the highest correlation with your target variable.

This is not a full proof method like Ordinary Least Squares or PCS, but this does provide basic insights into the data when a dataset is huge with lots of parameters in it.

Code for correlation plot: matrix = df.corr() f, ax = plt.subplots(figsize=(9, 6)) sns.heatmap(matrix, vmax=.8, square=True, cmap="BuPu")

Hope this helps you!

Best, Kalyani

pr24 commented 5 years ago

I think you should drop that column.