juanrh / sscheck

ScalaCheck for Spark
Apache License 2.0
63 stars 9 forks source link

Improve asynchrony of property evaluation for streaming #22

Open juanrh opened 9 years ago

juanrh commented 9 years ago

Aync systems gain performance by performing switching to another tasks when the former task gets blocked waiting for IO. In the implementation of DStreamProp.forAllAlways we have a huge IO block which is waiting for the next batch to be completed. In local mode it is still the local machine which is working, but if sparkMaster was pointing to a cluster then the driver process would be just waiting. On the other hand, when the new batch is completed the driver suddenly will have a lot of work in the corresponding foreachRDD, updating the result of the property to account for the new batch. When DStreamProp.forAll is introduced in the future, that update work will be often not trivial, if the formula gets complicated. We can fix this, by considering that the current design of DStreamProp.forAllAlways started from the prejudice that we'll be sending data from Prop.forAll after each batch interval, because that is what we had in previous attempts based on actor receivers. We don't need to do that anymore, but otherwise we could:

I think this issue should be resolved only after #19 , and with a solution compatible with #19