quinngroup / dr1dl-pyspark

Dictionary Learning in PySpark
Apache License 2.0
1 stars 1 forks source link

Explore DataFrames for possible serialization speed-ups #59

Open magsol opened 8 years ago

magsol commented 8 years ago

We need to examine Spark's DataFrame API as a possible alternative for representing our data (beyond RDDs). DataFrames are structured abstractions; as such, Spark understands the schema prior to execution and can therefore optimize the underlying binary representation to its furthest extent.

http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame

magsol commented 8 years ago

http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.SQLContext.createDataFrame

magsol commented 8 years ago

http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.types.ArrayType