Open Indu-sharma opened 6 years ago
Here is one way to get it working with streaming job. I haven't tried it with streaming yet. Let me know if this serves your purpose.
--packages qubole:sparklens:0.1.2-s_2.11
but don't specify the extraListener config.
import com.qubole.sparklens.QuboleNotebookListener
val QNL = new QuboleNotebookListener(sc.getConf)
sc.addSparkListener(QNL)
Basically, create a listener(note that this is Notebook listener and not JobListener) and register it.
3. within your streaming function (whatever is repeatedly called), wrap your code in the following:
QNL.profileIt { //Your code here }
Alternatively, if you need more control:
if (QNL.estimateSize() > QNL.getMaxDataSize()) { QNL.purgeJobsAndStages() } val startTime = System.currentTimeInMillis <-- Your scala code here --> endTime = System.currentTimeInMillis //wait for some time to get all events to accumulate Thread.sleep(QNL.getWaiTimeInSeconds()) println(QNL.getStats(startTime, endTime))
4. Checkout https://github.com/qubole/sparklens/blob/master/src/main/scala/com/qubole/sparklens/QuboleNotebookListener.scala for more information.
thanks!
We have tried using QuboleJobListener for structured streaming , but as @Indu-sharma and others mentioned it will only provide reports after terminating the streaming query and also it provides for all the Jobs together (not batch wise)
But in general, as these Structured streaming applications are continuously running, users/developers will be interested to see stats for every few batches.
Detailed proposal is attached as below. Please review and provide your inputs.
@akumarb2010 : I took a look at the proposal. Yes, you are perfectly right in saying that we need to emit stats every k-batch and that helps identify the streaming jobs buttlenecks over the period of time(batches). However, i'm not sure if you are considering the sizing estimates after every stats emitted.
@akumarb2010 This is a good start! I believe one of the key questions for streaming jobs would be: if my input data rate increases, can I still meet the SLA by adding more compute resources? One question: How much is the overlap/difference between SparkListener & StreamingListener? Do we need to combine information from both or just one of them is sufficient? It would be nice to capture what data is available, what needs to be aggregated and when, bit of details on main computations and how to control the frequency of output. Other concern would be how much data we need to keep in memory for all this to work. We need to keep a bound or provide a tradeoff between memory and accuracy. One again, thanks for taking this up. CC: @itsvikramagr
@iamrohit, @akumarb2010 - Yes you are right. Apart from handling better cluster resources, the most important task is to manage the streaming pipeline. One of the key metrics to consider is batch latency and its correlation with input rows. We cannot afford to have batch latency to be consistently more than batch trigger duration and some corrective measures are required.
StreamingQueryListener is very specific to Structured streaming and there is no overlap with SparkListner. We would need to combine information from both.
Awesome! Thanks for taking this up. Lets discuss more once you are ready with the PR.
@iamrohit , @itsvikramagr & @Indu-sharma Thanks for your inputs.
Started working on this feature. Will update you once, we are done with initial PR.
@akumarb2010 : Could you please update on this if any ? Thank you very much.
@Indu-sharma @akumarb2010 You can check out our new project Streaminglens if you plan to use Sparklens for Streaming applications.
Sparklens seem to be waiting for the job to complete, however it doesn't seem to be working with the Spark streaming jobs. Do you have any plan to fix it ?