se162xg / notes

1 stars 1 forks source link

pyspark ml #9

Open se162xg opened 4 years ago

se162xg commented 4 years ago

csv file on HDFS

from pyspark.sql import SparkSession
spark=SparkSession.builder.appName("test").getOrCreate()
df = spark.read.parquet("")

//save  dataframe to csv file
df.coalesce(1).write.mode("overwrite").option("header", True).csv("hdfs://path/to/csv")

//restore dataframe from csv file
df = spark.read.csv("hdfs://path/to/csv", header=True, inferSchema=True)
df.dtypes

csv file on local disk

from pyspark.sql import SparkSession
import pandas as pd
spark=SparkSession.builder.appName("test").getOrCreate()
df = spark.read.parquet("")
"""convert spark df to pandas df"""
df.coalesce(1).toPandas().to_csv("/path/to/csv")
se162xg commented 4 years ago

VectorAssembler

VectorAssembler is a transformer that combines a given list of columns into a single vector column.(Feature Vector)

from pyspark.ml.feature import VectorAssembler
vec_assembler = VectorAssembler(inputCols=['age', 'height'], outputCol='features')
new_df = vec_assembler.transform(df)

DataFrame[age: bigint, height: bigint, name: string, height_age: vector]
age,height,name,features
5 ,80,Alice, [80.0,5.0] 
10 ,80,Alice,[80.0,10.0]

*StringType is not supported