gorillalabs / sparkling-getting-started

A companion repo to the sparkling getting started guide
http://gorillalabs.github.io/sparkling/articles/getting_started.html
19 stars 9 forks source link

how to submit tf-idf example to cluster #3

Open BenMacKenzie opened 8 years ago

BenMacKenzie commented 8 years ago

this doesn't work:

edit tf_idf/core.clj and delete the line: (conf/master "local")

lein uberjar spark-submit --class tf_idf.core --master yarn target/sparkling-getting-started-1.0.0-SNAPSHOT-standalone.jar

chrisbetz commented 8 years ago

Hi,

sorry, I just don't get what's the thing and what's not working. To avoid guesswork, could you please a) tell me what you want to achieve and b) attach an error message or a little more details?

Thanks,

Chris

BenMacKenzie commented 8 years ago

Hi Chris,

I just want to submit the sparkling example tf-idf to an actual cluster. The cluster in question is just standard AWS EMR with Spark. I’ve been able to do it with Flambo but not sparkling. I believe the problem relates to AOT compiling. To you have an example project file for using an actual cluster?

Thanks!

From: chris_betz notifications@github.com<mailto:notifications@github.com> Reply-To: gorillalabs/sparkling-getting-started reply@reply.github.com<mailto:reply@reply.github.com> Date: Friday, November 20, 2015 at 2:04 PM To: gorillalabs/sparkling-getting-started sparkling-getting-started@noreply.github.com<mailto:sparkling-getting-started@noreply.github.com> Cc: "MacKenzie, Ben" ben.mackenzie@thinkbiganalytics.com<mailto:ben.mackenzie@thinkbiganalytics.com> Subject: Re: [sparkling-getting-started] how to submit tf-idf example to cluster (#3)

Hi,

sorry, I just don't get what's the thing and what's not working. To avoid guesswork, could you please a) tell me what you want to achieve and b) attach an error message or a little more details?

Thanks,

Chris

— Reply to this email directly or view it on GitHubhttps://github.com/gorillalabs/sparkling-getting-started/issues/3#issuecomment-158495957.

chrisbetz commented 8 years ago

Ah, now I get it :)

no problem, I’ve always pushed my stuff to yarn, so I know it’s doable.

I do not have my cluster at hand right now, but I think that’s a pretty easy one:

You must call the main-class directly, not the -$main-"sub"class

(as was in the mail I got from github, but not visible in the github web UI, I’m really puzzled now).

So

lein uberjar
spark-submit --class tf_idf.core --master yarn target/sparkling-getting-started-1.0.0-SNAPSHOT-standalone.jar

should do the trick instead of

… —class tf_idf.core$_main …

Hope this helps, and if it does, you may close the ticket. Otherwise I’d have to try on monday morning.

Cheers,

Chris

Am 20.11.2015 um 20:08 schrieb BenMacKenzie notifications@github.com:

Hi Chris,

I just want to submit the sparkling example tf-idf to an actual cluster. The cluster in question is just standard AWS EMR with Spark. I’ve been able to do it with Flambo but not sparkling. I believe the problem relates to AOT compiling. To you have an example project file for using an actual cluster?

Thanks!

From: chris_betz notifications@github.com<mailto:notifications@github.com> Reply-To: gorillalabs/sparkling-getting-started reply@reply.github.com<mailto:reply@reply.github.com> Date: Friday, November 20, 2015 at 2:04 PM To: gorillalabs/sparkling-getting-started sparkling-getting-started@noreply.github.com<mailto:sparkling-getting-started@noreply.github.com> Cc: "MacKenzie, Ben" ben.mackenzie@thinkbiganalytics.com<mailto:ben.mackenzie@thinkbiganalytics.com> Subject: Re: [sparkling-getting-started] how to submit tf-idf example to cluster (#3)

Hi,

sorry, I just don't get what's the thing and what's not working. To avoid guesswork, could you please a) tell me what you want to achieve and b) attach an error message or a little more details?

Thanks,

Chris

— Reply to this email directly or view it on GitHubhttps://github.com/gorillalabs/sparkling-getting-started/issues/3#issuecomment-158495957. — Reply to this email directly or view it on GitHub https://github.com/gorillalabs/sparkling-getting-started/issues/3#issuecomment-158496830.

BenMacKenzie commented 8 years ago

Funny…i had tried that (not calling _$main) but it didn’t seem to work, so i just started from scratch and it worked fine!

Thanks for your help! Look forward to using sparkling.

Ben

From: chris_betz notifications@github.com<mailto:notifications@github.com> Reply-To: gorillalabs/sparkling-getting-started reply@reply.github.com<mailto:reply@reply.github.com> Date: Friday, November 20, 2015 at 2:24 PM To: gorillalabs/sparkling-getting-started sparkling-getting-started@noreply.github.com<mailto:sparkling-getting-started@noreply.github.com> Cc: "MacKenzie, Ben" ben.mackenzie@thinkbiganalytics.com<mailto:ben.mackenzie@thinkbiganalytics.com> Subject: Re: [sparkling-getting-started] how to submit tf-idf example to cluster (#3)

Ah, now I get it :)

no problem, I’ve always pushed my stuff to yarn, so I know it’s doable.

I do not have my cluster at hand right now, but I think that’s a pretty easy one:

You must call the main-class directly, not the -$main-"sub"class

(as was in the mail I got from github, but not visible in the github web UI, I’m really puzzled now).

So

lein uberjar spark-submit --class tf_idf.core --master yarn target/sparkling-getting-started-1.0.0-SNAPSHOT-standalone.jar

should do the trick instead of

… —class tf_idf.core$_main …

Hope this helps, and if it does, you may close the ticket. Otherwise I’d have to try on monday morning.

Cheers,

Chris

Am 20.11.2015 um 20:08 schrieb BenMacKenzie notifications@github.com<mailto:notifications@github.com>:

Hi Chris,

I just want to submit the sparkling example tf-idf to an actual cluster. The cluster in question is just standard AWS EMR with Spark. I’ve been able to do it with Flambo but not sparkling. I believe the problem relates to AOT compiling. To you have an example project file for using an actual cluster?

Thanks!

From: chris_betz notifications@github.com<mailto:notifications@github.commailto:notifications@github.com> Reply-To: gorillalabs/sparkling-getting-started reply@reply.github.com<mailto:reply@reply.github.commailto:reply@reply.github.com> Date: Friday, November 20, 2015 at 2:04 PM To: gorillalabs/sparkling-getting-started sparkling-getting-started@noreply.github.com<mailto:sparkling-getting-started@noreply.github.commailto:sparkling-getting-started@noreply.github.com> Cc: "MacKenzie, Ben" ben.mackenzie@thinkbiganalytics.com<mailto:ben.mackenzie@thinkbiganalytics.commailto:ben.mackenzie@thinkbiganalytics.com> Subject: Re: [sparkling-getting-started] how to submit tf-idf example to cluster (#3)

Hi,

sorry, I just don't get what's the thing and what's not working. To avoid guesswork, could you please a) tell me what you want to achieve and b) attach an error message or a little more details?

Thanks,

Chris

— Reply to this email directly or view it on GitHubhttps://github.com/gorillalabs/sparkling-getting-started/issues/3#issuecomment-158495957. — Reply to this email directly or view it on GitHub https://github.com/gorillalabs/sparkling-getting-started/issues/3#issuecomment-158496830.

— Reply to this email directly or view it on GitHubhttps://github.com/gorillalabs/sparkling-getting-started/issues/3#issuecomment-158500693.

chrisbetz commented 8 years ago

Great. Love those bugs, too.

Would love to hear from you again with a successful (sparkling) project, or otherwise I will try to help you again :)

Happy hacking!

Chris

Am 20.11.2015 um 20:38 schrieb BenMacKenzie notifications@github.com:

Funny…i had tried that (not calling _$main) but it didn’t seem to work, so i just started from scratch and it worked fine!

Thanks for your help! Look forward to using sparkling.

Ben

From: chris_betz notifications@github.com<mailto:notifications@github.com> Reply-To: gorillalabs/sparkling-getting-started reply@reply.github.com<mailto:reply@reply.github.com> Date: Friday, November 20, 2015 at 2:24 PM To: gorillalabs/sparkling-getting-started sparkling-getting-started@noreply.github.com<mailto:sparkling-getting-started@noreply.github.com> Cc: "MacKenzie, Ben" ben.mackenzie@thinkbiganalytics.com<mailto:ben.mackenzie@thinkbiganalytics.com> Subject: Re: [sparkling-getting-started] how to submit tf-idf example to cluster (#3)

Ah, now I get it :)

no problem, I’ve always pushed my stuff to yarn, so I know it’s doable.

I do not have my cluster at hand right now, but I think that’s a pretty easy one:

You must call the main-class directly, not the -$main-"sub"class

(as was in the mail I got from github, but not visible in the github web UI, I’m really puzzled now).

So

lein uberjar spark-submit --class tf_idf.core --master yarn target/sparkling-getting-started-1.0.0-SNAPSHOT-standalone.jar

should do the trick instead of

… —class tf_idf.core$_main …

Hope this helps, and if it does, you may close the ticket. Otherwise I’d have to try on monday morning.

Cheers,

Chris

Am 20.11.2015 um 20:08 schrieb BenMacKenzie notifications@github.com<mailto:notifications@github.com>:

Hi Chris,

I just want to submit the sparkling example tf-idf to an actual cluster. The cluster in question is just standard AWS EMR with Spark. I’ve been able to do it with Flambo but not sparkling. I believe the problem relates to AOT compiling. To you have an example project file for using an actual cluster?

Thanks!

From: chris_betz notifications@github.com<mailto:notifications@github.commailto:notifications@github.com> Reply-To: gorillalabs/sparkling-getting-started reply@reply.github.com<mailto:reply@reply.github.commailto:reply@reply.github.com> Date: Friday, November 20, 2015 at 2:04 PM To: gorillalabs/sparkling-getting-started sparkling-getting-started@noreply.github.com<mailto:sparkling-getting-started@noreply.github.commailto:sparkling-getting-started@noreply.github.com> Cc: "MacKenzie, Ben" ben.mackenzie@thinkbiganalytics.com<mailto:ben.mackenzie@thinkbiganalytics.commailto:ben.mackenzie@thinkbiganalytics.com> Subject: Re: [sparkling-getting-started] how to submit tf-idf example to cluster (#3)

Hi,

sorry, I just don't get what's the thing and what's not working. To avoid guesswork, could you please a) tell me what you want to achieve and b) attach an error message or a little more details?

Thanks,

Chris

— Reply to this email directly or view it on GitHubhttps://github.com/gorillalabs/sparkling-getting-started/issues/3#issuecomment-158495957. — Reply to this email directly or view it on GitHub https://github.com/gorillalabs/sparkling-getting-started/issues/3#issuecomment-158496830.

— Reply to this email directly or view it on GitHubhttps://github.com/gorillalabs/sparkling-getting-started/issues/3#issuecomment-158500693. — Reply to this email directly or view it on GitHub https://github.com/gorillalabs/sparkling-getting-started/issues/3#issuecomment-158504487.

BenMacKenzie commented 8 years ago

I’ll definitely let you know about any successes.

Any plans to support spark sql?

From: chris_betz notifications@github.com<mailto:notifications@github.com> Reply-To: gorillalabs/sparkling-getting-started reply@reply.github.com<mailto:reply@reply.github.com> Date: Friday, November 20, 2015 at 2:41 PM To: gorillalabs/sparkling-getting-started sparkling-getting-started@noreply.github.com<mailto:sparkling-getting-started@noreply.github.com> Cc: "MacKenzie, Ben" ben.mackenzie@thinkbiganalytics.com<mailto:ben.mackenzie@thinkbiganalytics.com> Subject: Re: [sparkling-getting-started] how to submit tf-idf example to cluster (#3)

Great. Love those bugs, too.

Would love to hear from you again with a successful (sparkling) project, or otherwise I will try to help you again :)

Happy hacking!

Chris

Am 20.11.2015 um 20:38 schrieb BenMacKenzie notifications@github.com<mailto:notifications@github.com>:

Funny…i had tried that (not calling _$main) but it didn’t seem to work, so i just started from scratch and it worked fine!

Thanks for your help! Look forward to using sparkling.

Ben

From: chris_betz notifications@github.com<mailto:notifications@github.commailto:notifications@github.com> Reply-To: gorillalabs/sparkling-getting-started reply@reply.github.com<mailto:reply@reply.github.commailto:reply@reply.github.com> Date: Friday, November 20, 2015 at 2:24 PM To: gorillalabs/sparkling-getting-started sparkling-getting-started@noreply.github.com<mailto:sparkling-getting-started@noreply.github.commailto:sparkling-getting-started@noreply.github.com> Cc: "MacKenzie, Ben" ben.mackenzie@thinkbiganalytics.com<mailto:ben.mackenzie@thinkbiganalytics.commailto:ben.mackenzie@thinkbiganalytics.com> Subject: Re: [sparkling-getting-started] how to submit tf-idf example to cluster (#3)

Ah, now I get it :)

no problem, I’ve always pushed my stuff to yarn, so I know it’s doable.

I do not have my cluster at hand right now, but I think that’s a pretty easy one:

You must call the main-class directly, not the -$main-"sub"class

(as was in the mail I got from github, but not visible in the github web UI, I’m really puzzled now).

So

lein uberjar spark-submit --class tf_idf.core --master yarn target/sparkling-getting-started-1.0.0-SNAPSHOT-standalone.jar

should do the trick instead of

… —class tf_idf.core$_main …

Hope this helps, and if it does, you may close the ticket. Otherwise I’d have to try on monday morning.

Cheers,

Chris

Am 20.11.2015 um 20:08 schrieb BenMacKenzie notifications@github.com<mailto:notifications@github.commailto:notifications@github.com>:

Hi Chris,

I just want to submit the sparkling example tf-idf to an actual cluster. The cluster in question is just standard AWS EMR with Spark. I’ve been able to do it with Flambo but not sparkling. I believe the problem relates to AOT compiling. To you have an example project file for using an actual cluster?

Thanks!

From: chris_betz notifications@github.com<mailto:notifications@github.commailto:notifications@github.commailto:notifications@github.com> Reply-To: gorillalabs/sparkling-getting-started reply@reply.github.com<mailto:reply@reply.github.commailto:reply@reply.github.commailto:reply@reply.github.com> Date: Friday, November 20, 2015 at 2:04 PM To: gorillalabs/sparkling-getting-started sparkling-getting-started@noreply.github.com<mailto:sparkling-getting-started@noreply.github.commailto:sparkling-getting-started@noreply.github.commailto:sparkling-getting-started@noreply.github.com> Cc: "MacKenzie, Ben" ben.mackenzie@thinkbiganalytics.com<mailto:ben.mackenzie@thinkbiganalytics.commailto:ben.mackenzie@thinkbiganalytics.commailto:ben.mackenzie@thinkbiganalytics.com> Subject: Re: [sparkling-getting-started] how to submit tf-idf example to cluster (#3)

Hi,

sorry, I just don't get what's the thing and what's not working. To avoid guesswork, could you please a) tell me what you want to achieve and b) attach an error message or a little more details?

Thanks,

Chris

— Reply to this email directly or view it on GitHubhttps://github.com/gorillalabs/sparkling-getting-started/issues/3#issuecomment-158495957. — Reply to this email directly or view it on GitHub https://github.com/gorillalabs/sparkling-getting-started/issues/3#issuecomment-158496830.

— Reply to this email directly or view it on GitHubhttps://github.com/gorillalabs/sparkling-getting-started/issues/3#issuecomment-158500693. — Reply to this email directly or view it on GitHub https://github.com/gorillalabs/sparkling-getting-started/issues/3#issuecomment-158504487.

— Reply to this email directly or view it on GitHubhttps://github.com/gorillalabs/sparkling-getting-started/issues/3#issuecomment-158505044.

chrisbetz commented 8 years ago

Any plans to support spark sql?

… I should definitively have a look into that sometime soon. Same for Spark streaming… I’ll check in the next month and keep everyone updated via twitter or the mailinglist.

Cheers,

Chris

BenMacKenzie commented 8 years ago

Hi Chris,

Sparkling looks like it might work well for us as a tool for data profiling and data cleansing. Here is some of the work i’ve done over the last couple days. I’m far from an Clojure expert, so the code is probably a bit rough. https://github.com/gatineausoftware/data-profile

Everything has been going very well. I have run into one problem with a stack overflow however. It’s related to sparkling/reduce. Possibly because I am calling a map inside the reduce (not a spark map, just a regular clojure map). The big picture is to profile a collecting of CSV data to determine whether columns are integers, strings, dates etc.. And for each data type, what the range is. It’s would be very simple to do in clojure because the reduce function can take two parameters, more complicated in spark (but not that complicated). Anyhow, if you are interested or can offer any advice, the problem is with the profile-rdd function.

https://github.com/gatineausoftware/data-profile/blob/master/src/data_profile/profile.clj

Best regards,

Ben

From: chris_betz notifications@github.com<mailto:notifications@github.com> Reply-To: gorillalabs/sparkling-getting-started reply@reply.github.com<mailto:reply@reply.github.com> Date: Friday, November 20, 2015 at 2:55 PM To: gorillalabs/sparkling-getting-started sparkling-getting-started@noreply.github.com<mailto:sparkling-getting-started@noreply.github.com> Cc: "MacKenzie, Ben" ben.mackenzie@thinkbiganalytics.com<mailto:ben.mackenzie@thinkbiganalytics.com> Subject: Re: [sparkling-getting-started] how to submit tf-idf example to cluster (#3)

Any plans to support spark sql?

… I should definitively have a look into that sometime soon. Same for Spark streaming… I’ll check in the next month and keep everyone updated via twitter or the mailinglist.

Cheers,

Chris

— Reply to this email directly or view it on GitHubhttps://github.com/gorillalabs/sparkling-getting-started/issues/3#issuecomment-158508809.

chrisbetz commented 8 years ago

Hi Ben,

sorry, I'm a little busy today. I will look into your problem, just not right now. Sorry, I'n not having better news right now.

Bye

Chris

BenMacKenzie commented 8 years ago

Thanks Chris,

Any insights would be appreciated.

I have somewhat solved the problem. Instead of calling spark/reduce on a function that makes use of a clojure map, i call spark/reduce on a function that uses a loop instead. It seems to work. Seems to have something to do with lazy evaluation and thunking.

I also somewhat mis-spoke in my previous email. Profiling columns of data across a collection of rows is awkward in spark, because the reduce function be both commutative and associative: commutativity implies that the two arguments must be the same. If it were only associative, one argument could be a ‘row’ of data and the other could be an ‘accumulator’ which contains all the stats (e.g., min, max). In spark, i need to associate the ‘accumulator’ with the data during the map phase.

From: chris_betz notifications@github.com<mailto:notifications@github.com> Reply-To: gorillalabs/sparkling-getting-started reply@reply.github.com<mailto:reply@reply.github.com> Date: Friday, November 27, 2015 at 4:38 AM To: gorillalabs/sparkling-getting-started sparkling-getting-started@noreply.github.com<mailto:sparkling-getting-started@noreply.github.com> Cc: "MacKenzie, Ben" ben.mackenzie@thinkbiganalytics.com<mailto:ben.mackenzie@thinkbiganalytics.com> Subject: Re: [sparkling-getting-started] how to submit tf-idf example to cluster (#3)

Hi Ben,

sorry, I'm a little busy today. I will look into your problem, just not right now. Sorry, I'n not having better news right now.

Bye

Chris

— Reply to this email directly or view it on GitHubhttps://github.com/gorillalabs/sparkling-getting-started/issues/3#issuecomment-160094218.