qubole / sparklens

Qubole Sparklens tool for performance tuning Apache Spark
http://sparklens.qubole.com
Apache License 2.0
561 stars 138 forks source link

Sparklens for streaming #6

Open dominikabasaj opened 6 years ago

dominikabasaj commented 6 years ago

Hi,

Are there any plans to adjust Sparklens for streaming processing? I assume that right now it is suitable only for batch processes?

Best, Dominika

iamrohit commented 6 years ago

Thanks for bringing this up @dominikabasaj. This is definitely on the radar and we will be adding support for Streaming. I will encourage you to wear a PM hat and help us define the requirements/use cases/etc around this feature. This will help us validate what we are thinking and makes sure you get what you are looking for in this feature. CC: @itsvikramagr

iamrohit commented 6 years ago

@dominikabasaj

Here is one way to get it working with streaming job. I haven't tried it with streaming yet. Let me know if this serves your purpose.

1.Start your application with --packages qubole:sparklens:0.1.2-s_2.11 but don't specify the extraListener config.

  1. As part of your application, do the following:
    import com.qubole.sparklens.QuboleNotebookListener
    val QNL = new QuboleNotebookListener(sc.getConf)
    sc.addSparkListener(QNL)

    Basically, create a listener(note that this is Notebook listener and not JobListener) and register it.

  2. within your streaming function (whatever is repeatedly called), wrap your code in the following:
    QNL.profileIt {
    //Your code here
    }

    Alternatively, if you need more control:

if (QNL.estimateSize() > QNL.getMaxDataSize()) {
  QNL.purgeJobsAndStages()
}
val startTime = System.currentTimeInMillis
<-- Your scala code here -->
endTime = System.currentTimeInMillis
//wait for some time to get all events to accumulate 
Thread.sleep(QNL.getWaiTimeInSeconds())
println(QNL.getStats(startTime, endTime))
  1. Checkout https://github.com/qubole/sparklens/blob/master/src/main/scala/com/qubole/sparklens/QuboleNotebookListener.scala for more information.

thanks!

akumarb2010 commented 6 years ago

Sorry for duplicating, but this issue is also related to streaming, so just thought of updating.

We have tried using QuboleJobListener for structured streaming , but it will only provide reports after terminating the streaming query and also it provides for all the Jobs together (not batch wise)

But in general, as these Structured streaming applications are continuously running, users/developers will be interested to see stats for every few batches.

Detailed proposal is attached as below. Please review and provide your inputs.

Structured_streaming_sparklens.pdf

abhishekd0907 commented 4 years ago

@dominikabasaj @akumarb2010 You can check out our new project Streaminglens if you plan to use Sparklens for Streaming applications.