nchammas / flintrock

A command-line tool for launching Apache Spark clusters.
Apache License 2.0
636 stars 116 forks source link

How to catch error when running flintrock in a bash script? #309

Closed gogolaylago closed 4 years ago

gogolaylago commented 4 years ago

So I don't know if I'm being dense, but I can't figure out how to run flintrock in bash script and catch errors. This is my script.sh file:

$ cat script.sh

flintrock --config config.yaml launch test_cluster
flintrock --config config.yaml run-command --master-only test_cluster "spark-submit --packages org.apache.hadoop:hadoop-aws:2.7.6 my_python_script.py $1
 ret_code=$?
   if [ $ret_code -ne 0 ]; then 
      exit $ret_code
   fi"
flintrock --config config.yaml destroy --assume-yes test_cluster

Even when an error occured in my_python_script.py, the .sh script will exit as if nothing happened. What should I do?

nchammas commented 4 years ago

Have you tried set -e at the start of script.sh?

gogolaylago commented 4 years ago

Have you tried set -e at the start of script.sh?

I did, so i added

set -x
set -e
...

to print the command line while it runs and try to interrupt when it fails, no luck

nchammas commented 4 years ago

So basically flintrock run-command returns a success even if the run command fails on the cluster?

gogolaylago commented 4 years ago

So basically flintrock run-command returns a success even if the run command fails on the cluster?

Correct, so the weird thing is this: when I run this command in the cluster flintrock --config config.yaml run-command --master-only test_cluster 'spark-submit --packages org.apache.hadoop:hadoop-aws:2.7.6 my_python_script.py' I get this error (I purposefully made a non-existing variable in my .py script called 'TODAY_STR' which wasn't defined)

Traceback (most recent call last):
  File "/home/ec2-user/my_python_script.py", line 22, in <module>
    print('Running files on %s' % TODAY_STR)
NameError: name 'TODAY_STR' is not defined

But when I execute the .sh file with the above command in it, nothing happens. Flintrock just went ahead to destroy the cluster instead of stopping the app:

...
++ flintrock --config config.yaml run-command --master-only test_cluster "spark-submit --packages org.apache.hadoop:hadoop-aws:2.7.6 my_python_script.py $1
 ret_code=$?
   if [ $ret_code -ne 0 ]; then 
      exit $ret_code
   fi"
Running command on master only...
[54.204.180.54] Running command...
[54.204.180.54] Command complete.
run_command finished in 0:00:31.
++ flintrock --config config.yaml destroy --assume-yes test_cluster
Destroying test_cluster...
nchammas commented 4 years ago

What happens if you capture your commands in a script, upload that to the cluster with copy-file, and then execute it with run-command?

i.e.

cat << EOM > le-script.sh
spark-submit --packages org.apache.hadoop:hadoop-aws:2.7.6 my_python_script.py $1
ret_code=$?
if [ $ret_code -ne 0 ]; then 
    exit $ret_code
fi
EOM

flintrock copy-file ...
flintrock run-command ... "chmod u+x le-script.sh"
flintrock run-command ... "le-script.sh"  # should correctly reflect exit code of le-script.sh

I'm wondering if the string of commands jammed into a single string is somehow suppressing the return value.

gogolaylago commented 4 years ago

What happens if you capture your commands in a script, upload that to the cluster with copy-file, and then execute it with run-command?

i.e.

cat << EOM > le-script.sh
spark-submit --packages org.apache.hadoop:hadoop-aws:2.7.6 my_python_script.py $1
ret_code=$?
if [ $ret_code -ne 0 ]; then 
    exit $ret_code
fi
EOM

flintrock copy-file ...
flintrock run-command ... "chmod u+x le-script.sh"
flintrock run-command ... "le-script.sh"  # should correctly reflect exit code of le-script.sh

I'm wondering if the string of commands jammed into a single string is somehow suppressing the return value.

I think you might be right. I did what you suggested, the exit code was successfully shown. So I removed this part from "le-script.sh"

> ret_code=$?
> if [ $ret_code -ne 0 ]; then 
>     exit $ret_code
> fi

and then ran flintrock run-command ... "le-script.sh" , the script was successfully interrupted. I wonder what about the above lines are colliding with the command though, hmmmm.

Anyways, thank you!!

nchammas commented 4 years ago

I wonder what about the above lines are colliding with the command though, hmmmm.

It's probably something about how shell return codes are interpreted for a single command vs. a serious of commands. Perhaps some fiddling with sub-shells or other Bash/shell constructs might give you more of the behavior you're looking for, but I'm not sure.

e.g.

flintrock --config config.yaml run-command --master-only test_cluster "(
spark-submit --packages org.apache.hadoop:hadoop-aws:2.7.6 my_python_script.py $1
ret_code=$?
    if [ $ret_code -ne 0 ]; then 
        exit $ret_code
    fi
)"

(Note the enclosing parentheses.)

But certainly, the most reliable/understandable approach is to copy up a script to the cluster and execute it there, making sure to include set -e in the script.