vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
https://vaex.io
MIT License
8.23k stars 590 forks source link

[FEATURE-REQUEST] #2269

Closed san-vak closed 1 year ago

san-vak commented 1 year ago

Hi! thanks for the great library

Description 1- get the explained variance by the components after applying pca 2- just export a subset of columns like virtual pca columns

**Is your feature request related to a problem? 1- to determine how much components to extract via pca 2- to reduce the size of the dataset

regards

JovanVeljanoski commented 1 year ago

Hi,

I hope this example shows you what you want to know:

import vaex
import vaex.ml
df = vaex.datasets.iris()

pca = vaex.ml.PCA(features=['sepal_length', 'sepal_width', 'petal_length', 'petal_width'])
df = pca.fit_transform(df)

print(f'Explained variance: {pca.explained_variance_}')
print(f'Explained variance ratio: {pca.explained_variance_ratio_}')

cols_to_export = ['PCA_0', 'PCA_1']
df[cols_to_export].export_hdf5('iris_pca.hdf5', progress='widget')
san-vak commented 1 year ago

thank you so much