spark-jobserver / spark-jobserver

REST job server for Apache Spark
Other
2.84k stars 999 forks source link

Spark Job Server + Spark Streaming + Zeppelin #821

Closed leizhanggit closed 5 years ago

leizhanggit commented 7 years ago

Hello, I have a project need to analyze real-time data from spark streaming job.

My thought is that:

  1. Using Spark streaming to collect real-time data from kafka.
  2. Using Spark job server to expose the real-time data to the other spark jobs ( shared RDD)
  3. Using Zeppelin to launch jobs and do real-time analysis.

I have several questions:

  1. How to expose real-time data by Spark job Server?
  2. Can spark job serve support Zeppelin, so that I can launch spark jobs on Zeppelin to get the shared RDD?
velvia commented 7 years ago

Hmm, how real time does the data you want to expose? What kind of queries do you need to run on the data? It sounds like you need more of a real-time data store to run queries against.

-Evan

On May 16, 2017, at 7:08 PM, leizhanggit notifications@github.com wrote:

Hello, I have a project need to analyze real-time data from spark streaming job.

My thought is that:

Using Spark streaming to collect real-time data from kafka. Using Spark job server to expose the real-time data to the other spark jobs ( shared RDD) Using Zeppelin to launch jobs and do real-time analysis. I have several questions:

How to expose real-time data by Spark job Server? Can spark job serve support Zeppelin, so that I can launch spark jobs on Zeppelin to get the shared RDD? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/spark-jobserver/spark-jobserver/issues/821, or mute the thread https://github.com/notifications/unsubscribe-auth/ABA32zuE7OAel35-DNHUEjUsrS-AqRdNks5r6laZgaJpZM4NdSiA.

leizhanggit commented 7 years ago

I just want to store the real-time data into a spark dataframe by a spark app, and launch another spark app to get the dataframe.

Maybe it is a little confused, I am essentially asking, can the shared RDD feature of SJS works with spark streaming?

velvia commented 7 years ago

So here’s the thing. Spark streaming works on micro-batches. Each batch is a separate RDD. There is a feature in Spark DStreams that will persist RDDs (though I haven’t used it before, not sure how many RDDs get persisted). You could have some other job in SJS that shares the same streamingContext that queries these persisted RDDs, though usually these would be streaming computations and not strictly data frames. Spark 2.x has something called “structured streaming” which are like streaming data frames. I’m not aware of a easy way to do non-streaming computations on the RDDs.

On May 18, 2017, at 1:40 PM, leizhanggit notifications@github.com wrote:

I just want to store the real-time data into a spark dataframe by a spark app, and launch another spark app to get the dataframe.

Maybe it is a little confused, I am essentially asking, can the shared RDD feature of SJS works with spark streaming?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/spark-jobserver/spark-jobserver/issues/821#issuecomment-302534932, or mute the thread https://github.com/notifications/unsubscribe-auth/ABA329h6FL83z3C097xbhfdkJ8Z7jTALks5r7KzGgaJpZM4NdSiA.

bsikander commented 5 years ago

Closing due to inactivity.