mdrakiburrahman / rakirahman.me

💻Personal blog powered by Gatsby
https://www.rakirahman.me
MIT License
1 stars 1 forks source link

Hi Raki, Thank you for your Spark Cert study Guide! #9

Closed saryeHaddadi closed 2 years ago

saryeHaddadi commented 2 years ago

Hi Raki,

I wanted to thank you for your Spark Cert study Guide! I have just read the part 1, I thought I had to reach out to you to thank you! Given this github issue opportunity, if I may, I would like to ask you a spark archi question. Could you confirm that the relation between a Spark Action & a Spark Job is always 1:1, meaning:

Thank you Raki, and have a nice week-end!

Best,

mdrakiburrahman commented 2 years ago

Hi @saryeHaddadi - thanks for the kind words!

My understanding is the answer is yes to both.

The way I like to think about it is that an Action is self-contained, say you have a DataFrame, the Action is you ask Spark to count it. That is one Action.

To respond to that single Action, Spark spins up a Job. Now, your DataFrame is actually a massive partitioned Dataset (let's say on Azure Storage), and so Spark spins up 200 Tasks, which are designated across Executors. When all 200 Tasks finish, the single Job finishes, and you get your answer to that single Action (e.g. 5 million rows).

Now of course a single Spark Script/Pipeline is going to have several such Actions, and so for each of those Actions you'd have a corresponding Job, Tasks and so on.

Hope that makes sense!