Automatically restart Spark and HDFS master a few times if there is an issue starting them up

nchammas commented 7 years ago

This PR band-aids a couple of related issues that have plagued Flintrock for a while: Spark and HDFS sometimes get stuck on master startup, especially after a cluster restart.

The issue is somehow related to EC2's network stack, but I don't have any solid leads on the root cause. Instead, I've implemented a good-enough workaround that finally gets the full acceptance suite passing on first try. It's been a while since I've been able to do that.

TODO:

[ ] ~Abstract out repeated retry logic across HDFS and Spark (?)~
[x] Try removing waits added in #160

This is probably as good a solution as I'm going to find for now, so I'm marking this PR as a fix for these two issues:

Fixes #129. Fixes #157.

nchammas commented 7 years ago

@kruhly - Following on to the discussion in #129, you may want to test out this PR and see if it fixes the issues you were seeing. It seems to do so for me, though I will try to tweak things a bit more before merging this in.

kruhly commented 7 years ago

Having trouble installing the dev version to test.

Package libffi was not found in the pkg-config search path. Perhaps you should add the directory containing `libffi.pc' to the PKG_CONFIG_PATH environment variable No package 'libffi' found

Will look into tomorrow.

nchammas commented 7 years ago

Can you share:

How you are installing the dev version
More of the stack trace, if any
What your environment is (OS, Python version, etc.)

kruhly commented 7 years ago

Following instructions from CONTRIBUTING.md under Setup heading but replaced virtual environment name venv -> fvenv

OS: Ubuntu 14.04 Python: 3.43

Installing collected packages: cryptography, Flintrock, idna, asn1crypto, six, cffi, pycparser Running setup.py install for cryptography Running command /home/rjk/odesk/flintrock/fvenv/bin/python3 -c "import setuptools, tokenize;file='/home/rjk/odesk/flintrock/fvenv/build/cryptography/setup.py';exec(compile(getattr(tokenize, 'open', open)(file).read().replace('\r\n', '\n'), file, 'exec'))" install --record /tmp/pip-8qfmnyqj-record/install-record.txt --single-version-externally-managed --compile --install-headers /home/rjk/odesk/flintrock/fvenv/include/site/python3.4 Package libffi was not found in the pkg-config search path. Perhaps you should add the directory containing libffi.pc' to the PKG_CONFIG_PATH environment variable No package 'libffi' found Package libffi was not found in the pkg-config search path. Perhaps you should add the directory containinglibffi.pc' to the PKG_CONFIG_PATH environment variable No package 'libffi' found Package libffi was not found in the pkg-config search path. Perhaps you should add the directory containing libffi.pc' to the PKG_CONFIG_PATH environment variable No package 'libffi' found Package libffi was not found in the pkg-config search path. Perhaps you should add the directory containinglibffi.pc' to the PKG_CONFIG_PATH environment variable No package 'libffi' found Package libffi was not found in the pkg-config search path. Perhaps you should add the directory containing `libffi.pc' to the PKG_CONFIG_PATH environment variable No package 'libffi' found c/_cffi_backend.c:15:17: fatal error: ffi.h: No such file or directory

include

^ compilation terminated. Traceback (most recent call last): File "/usr/lib/python3.4/distutils/unixccompiler.py", line 116, in _compile extra_postargs) File "/usr/lib/python3.4/distutils/ccompiler.py", line 909, in spawn spawn(cmd, dry_run=self.dry_run) File "/usr/lib/python3.4/distutils/spawn.py", line 36, in spawn _spawn_posix(cmd, search_path, dry_run=dry_run) File "/usr/lib/python3.4/distutils/spawn.py", line 162, in _spawn_posix % (cmd, exit_status)) distutils.errors.DistutilsExecError: command 'x86_64-linux-gnu-gcc' failed with exit status 1

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.4/distutils/core.py", line 148, in setup
    dist.run_commands()
  File "/usr/lib/python3.4/distutils/dist.py", line 955, in run_commands
    self.run_command(cmd)
  File "/usr/lib/python3.4/distutils/dist.py", line 974, in run_command
    cmd_obj.run()
  File "/home/rjk/odesk/flintrock/fvenv/lib/python3.4/site-packages/setuptools/command/bdist_egg.py", line 161, in run
    cmd = self.call_command('install_lib', warn_dir=0)
  File "/home/rjk/odesk/flintrock/fvenv/lib/python3.4/site-packages/setuptools/command/bdist_egg.py", line 147, in call_command
    self.run_command(cmdname)
  File "/usr/lib/python3.4/distutils/cmd.py", line 313, in run_command
    self.distribution.run_command(command)
  File "/usr/lib/python3.4/distutils/dist.py", line 974, in run_command
    cmd_obj.run()
  File "/home/rjk/odesk/flintrock/fvenv/lib/python3.4/site-packages/setuptools/command/install_lib.py", line 11, in run
    self.build()
  File "/usr/lib/python3.4/distutils/command/install_lib.py", line 109, in build
    self.run_command('build_ext')
  File "/usr/lib/python3.4/distutils/cmd.py", line 313, in run_command
    self.distribution.run_command(command)
  File "/usr/lib/python3.4/distutils/dist.py", line 974, in run_command
    cmd_obj.run()
  File "/home/rjk/odesk/flintrock/fvenv/lib/python3.4/site-packages/setuptools/command/build_ext.py", line 75, in run
    _build_ext.run(self)
  File "/usr/lib/python3.4/distutils/command/build_ext.py", line 339, in run
    self.build_extensions()
  File "/usr/lib/python3.4/distutils/command/build_ext.py", line 448, in build_extensions
    self.build_extension(ext)
  File "/home/rjk/odesk/flintrock/fvenv/lib/python3.4/site-packages/setuptools/command/build_ext.py", line 196, in build_extension
    _build_ext.build_extension(self, ext)
  File "/usr/lib/python3.4/distutils/command/build_ext.py", line 503, in build_extension
    depends=ext.depends)
  File "/usr/lib/python3.4/distutils/ccompiler.py", line 574, in compile
    self._compile(obj, src, ext, cc_args, extra_postargs, pp_opts)
  File "/usr/lib/python3.4/distutils/unixccompiler.py", line 118, in _compile
    raise CompileError(msg)
distutils.errors.CompileError: command 'x86_64-linux-gnu-gcc' failed with exit status 1

nchammas commented 7 years ago

According to Stack Overflow, you need to install libffi-dev, and perhaps some related packages too.

kruhly commented 7 years ago

Installing libffi-dev fixed the problem. I will do some testing tomorrow. Thanks

On Mon, Jul 24, 2017 at 10:50 PM, Nicholas Chammas <notifications@github.com

wrote:

According to Stack Overflow https://stackoverflow.com/questions/38109637/package-libffi-was-not-found-in-the-pkg-config-search-path-redhat6-5, you need to install libffi-dev, and perhaps some related packages too.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nchammas/flintrock/pull/204#issuecomment-317412506, or mute the thread https://github.com/notifications/unsubscribe-auth/ADi6waBAMzV6rkZOu95ISohpGtixCvglks5sRJMZgaJpZM4Ogv2g .

-- Robert Kruhlak Geelong, VIC Australia (M) +61 (0)481 091 244 (E) kruhly@gmail.com

kruhly commented 7 years ago

I did 5 launches with 1 master, 1 worker and t2.micro with short time intervals in between today.

2 launches finished ~ 2.3 minutes with health report of 0 workers, logs showed successful configuration and registered worker. Spark-ui showed registered worker

1 launch finished ~8 minutes. Health report showed 1 worker as expected. Looks like it recovered as expected.

1 launch stuck on configuring master. Timed out waiting for spark master to come up and asks to terminate cluster. AWS console shows master and worker running. Two spark log files .out and .out.1. .out.1 log file shows binding exception. .out log file shows successful configuration of master and register of worker.

1 launch timed out waiting for spark master to come up and asks to terminate cluster. AWS console shows master and worker running. Three log files .out, .out.1, .out.2 all have the binding exception. spark-ui can't be reached. Manual call to start-all.sh, starts master and returns that worker is running and to stop first. spark-ui comes up with master only then serveral minutes later the worker appears.

Hope this helps.

nchammas commented 7 years ago

2 launches finished ~ 2.3 minutes with health report of 0 workers

Yeah, sometimes the health report runs before all the workers are fully registered. It's an annoying issue but relatively harmless.

1 launch finished ~8 minutes. Health report showed 1 worker as expected. Looks like it recovered as expected.

So Flintrock showed the "Trying again..." message a few times followed by success? That's good.

1 launch stuck on configuring master. Timed out waiting for spark master to come up and asks to terminate cluster. AWS console shows master and worker running. Two spark log files .out and .out.1. .out.1 log file shows binding exception. .out log file shows successful configuration of master and register of worker.

1 launch timed out waiting for spark master to come up and asks to terminate cluster. AWS console shows master and worker running. Three log files .out, .out.1, .out.2 all have the binding exception. spark-ui can't be reached. Manual call to start-all.sh, starts master and returns that worker is running and to stop first. spark-ui comes up with master only then serveral minutes later the worker appears.

That's a bummer. This can definitely be addressed by waiting longer or trying more times, but at this point it feels like a very bad problem. And you're seeing this on launches, not restarts. My issues have typically been with restarts and only rarely with launches.

What region are you working in?

Are all the destroys/launches targeted at a cluster with the same name? What about if you try altering the name every time? I wonder if this problem is related to some kind of lease expiration/reacquisition on named resources in AWS.

From your sample of 5 launches you seem to be experiencing the problem much worse than even I am. :(

kruhly commented 7 years ago

Reply in text

So Flintrock showed the "Trying again..." message a few times followed by

success? That's good.

I did not think to note the details as it was successful.

1 launch stuck on configuring master. Timed out waiting for spark master to come up and asks to terminate cluster. AWS console shows master and worker running. Two spark log files .out and .out.1. .out.1 log file shows binding exception. .out log file shows successful configuration of master and register of worker.

1 launch timed out waiting for spark master to come up and asks to terminate cluster. AWS console shows master and worker running. Three log files .out, .out.1, .out.2 all have the binding exception. spark-ui can't be reached. Manual call to start-all.sh, starts master and returns that worker is running and to stop first. spark-ui comes up with master only then serveral minutes later the worker appears.

That's a bummer. This can definitely be addressed by waiting longer or trying more times, but at this point it feels like a very bad problem. And you're seeing this on launches, not restarts. My issues have typically been with restarts and only rarely with launches.

Seeing it on launch and restarts but tested the PR using launches because we are currently using launch in our pipeline.

What region are you working in?

us-east-1

Are all the destroys/launches targeted at a cluster with the same name? What about if you try altering the name every time? I wonder if this problem is related to some kind of lease expiration/reacquisition on named resources in AWS.

Launches are with different names each time but the cluster parameters are the same which I don't think rules out the reuse of AWS resources.

From your sample of 5 launches you seem to be experiencing the problem much

worse than even I am. :(

The test was not a typical workflow because our batches vary in size and arrival time which changes the cluster parameters and time between launches. Fortunately, we don't see this poor of performance normally.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nchammas/flintrock/pull/204#issuecomment-317775183, or mute the thread https://github.com/notifications/unsubscribe-auth/ADi6wTWI8tlVKslk6qUXRJBVCsHnp7j-ks5sRgotgaJpZM4Ogv2g .

-- Robert Kruhlak Geelong, VIC Australia (M) +61 (0)481 091 244 (E) kruhly@gmail.com

nchammas commented 7 years ago

@kruhly - I upped the number of attempts Flintrock will make to start the Spark and HDFS masters. Could you repeat your test of this branch and let me know how it goes?

I'm curious as to how bad the problem is for you, and whether with enough attempts things work out in the end. If they do, perhaps we should add an option for cases like these where the user is okay with waiting a while for Flintrock to retry starting the masters. But it's nonetheless maddening that this happens and I don't really have good leads as to why...

I have one additional thing I may try (and I'll ask you to repeat your test again, if that's OK 🙏), but I'll wait to see what you find first regarding the additional attempts.

kruhly commented 7 years ago

Will do.

kruhly commented 7 years ago

Ran 5 tests using the same settings as the previous tests without an issue after pulling the changes :-)

2 launches finished around 2 minutes with no signs of trying again

3 launches finished around 3:30 minutes with no signs of trying again

I will try another round tomorrow.

nchammas commented 7 years ago

Huh... I'm baffled as to why the latest changes would have any impact then, since I just upped the number of retries and it sounds like you're not needing to use them. Maybe the bad run from a couple of days ago was a fluke?

nchammas commented 7 years ago

I've set the attempt limit back down to where I'd ideally want to keep it. If this branch still works for you (which it seems like it should), then I think we have a good-enough solution for now. Let me know, and I'll make some final cleanups and merge this in.

On my side, I'm able to go through the full acceptance test suite without any errors, which is good.

kruhly commented 7 years ago

Maybe the phase of the moon or pressure on AWS resources?

I ran 6 more launches with the same configuration as before. All successful.

3 launches first attempt

3 launches second attempt

I'm fine with the lower attempts. If AWS is having a bad day, we will just have to take a break ... Thanks

nchammas commented 7 years ago

That's great news. I'll clean up this PR a bit and merge it in. Thanks for testing it out @kruhly! Your feedback was helpful in confirming that I've addressed the issue.

nchammas / flintrock

Automatically restart Spark and HDFS master a few times if there is an issue starting them up #204

include