juanrh / sscheck

ScalaCheck for Spark
Apache License 2.0
63 stars 9 forks source link

New generators from Spark datasets using sampling #41

Closed mtelloz closed 8 years ago

mtelloz commented 8 years ago

In this milestone we implement a higher order ScalaCheck generator (i.e. a function that returns a generator) from RDD[A] to Gen[A]. To implement this properly the generator should take a sample buffer size bfs as an argument, so in the first call the generator would sample bfs records from the RDD and return the first one, and then it would return the other sampled elements in the subsequent bfs - 1 calls.

First part of the milestone implemented by creating the class FromRDDGen and FromRDDGenTest which reads an avro file using SparkSQL into a DataFrame and then converts it to RDD. The class creates a sample withReplacemente = true and saves it in a list and returns a generator for every value stored in that list. The test class pass or fails depending on the number of minTestOK introduced and the value of the dafaultToLast parameter

Possible extensions:

add a parameter to FromRDDGen that allows to set sampling with or without replacement. To implement sampling without replacement, we store the input RDD in a var, and each time we get a sample to fill the buffer we compute a new RDD that doesn't contain the sampled elements and update the var to point to the new RDD (remember RDDs are immutable but we still can use a mutable reference to an immutable object to get mutable state). 

Second part of the milestoned almost completed , two classes where created FromRDDReplacement and FromRDDReplacementTest , it was necessary to add the SparkContext as a parameter of the class and also add another paramater withRepl which determinates if the sample is generated with replacement or not and make the FromRDDReplacement extend Serializable in order to make it work. The rdd is being stored in a var for persistency and when the sample is generated we use the broadcast function to generate a collection of the ids to later filter the RDD and leave only the rows not sampled.