Closed nchammas closed 7 years ago
@kruhly - Following on to the discussion in #129, you may want to test out this PR and see if it fixes the issues you were seeing. It seems to do so for me, though I will try to tweak things a bit more before merging this in.
Having trouble installing the dev version to test.
Package libffi was not found in the pkg-config search path. Perhaps you should add the directory containing `libffi.pc' to the PKG_CONFIG_PATH environment variable No package 'libffi' found
Will look into tomorrow.
Can you share:
Following instructions from CONTRIBUTING.md under Setup heading but replaced virtual environment name venv -> fvenv
OS: Ubuntu 14.04 Python: 3.43
Installing collected packages: cryptography, Flintrock, idna, asn1crypto, six, cffi, pycparser Running setup.py install for cryptography Running command /home/rjk/odesk/flintrock/fvenv/bin/python3 -c "import setuptools, tokenize;file='/home/rjk/odesk/flintrock/fvenv/build/cryptography/setup.py';exec(compile(getattr(tokenize, 'open', open)(file).read().replace('\r\n', '\n'), file, 'exec'))" install --record /tmp/pip-8qfmnyqj-record/install-record.txt --single-version-externally-managed --compile --install-headers /home/rjk/odesk/flintrock/fvenv/include/site/python3.4 Package libffi was not found in the pkg-config search path. Perhaps you should add the directory containing
libffi.pc' to the PKG_CONFIG_PATH environment variable No package 'libffi' found Package libffi was not found in the pkg-config search path. Perhaps you should add the directory containing
libffi.pc' to the PKG_CONFIG_PATH environment variable No package 'libffi' found Package libffi was not found in the pkg-config search path. Perhaps you should add the directory containinglibffi.pc' to the PKG_CONFIG_PATH environment variable No package 'libffi' found Package libffi was not found in the pkg-config search path. Perhaps you should add the directory containing
libffi.pc' to the PKG_CONFIG_PATH environment variable No package 'libffi' found Package libffi was not found in the pkg-config search path. Perhaps you should add the directory containing `libffi.pc' to the PKG_CONFIG_PATH environment variable No package 'libffi' found c/_cffi_backend.c:15:17: fatal error: ffi.h: No such file or directoryinclude
^ compilation terminated. Traceback (most recent call last): File "/usr/lib/python3.4/distutils/unixccompiler.py", line 116, in _compile extra_postargs) File "/usr/lib/python3.4/distutils/ccompiler.py", line 909, in spawn spawn(cmd, dry_run=self.dry_run) File "/usr/lib/python3.4/distutils/spawn.py", line 36, in spawn _spawn_posix(cmd, search_path, dry_run=dry_run) File "/usr/lib/python3.4/distutils/spawn.py", line 162, in _spawn_posix % (cmd, exit_status)) distutils.errors.DistutilsExecError: command 'x86_64-linux-gnu-gcc' failed with exit status 1
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3.4/distutils/core.py", line 148, in setup
dist.run_commands()
File "/usr/lib/python3.4/distutils/dist.py", line 955, in run_commands
self.run_command(cmd)
File "/usr/lib/python3.4/distutils/dist.py", line 974, in run_command
cmd_obj.run()
File "/home/rjk/odesk/flintrock/fvenv/lib/python3.4/site-packages/setuptools/command/bdist_egg.py", line 161, in run
cmd = self.call_command('install_lib', warn_dir=0)
File "/home/rjk/odesk/flintrock/fvenv/lib/python3.4/site-packages/setuptools/command/bdist_egg.py", line 147, in call_command
self.run_command(cmdname)
File "/usr/lib/python3.4/distutils/cmd.py", line 313, in run_command
self.distribution.run_command(command)
File "/usr/lib/python3.4/distutils/dist.py", line 974, in run_command
cmd_obj.run()
File "/home/rjk/odesk/flintrock/fvenv/lib/python3.4/site-packages/setuptools/command/install_lib.py", line 11, in run
self.build()
File "/usr/lib/python3.4/distutils/command/install_lib.py", line 109, in build
self.run_command('build_ext')
File "/usr/lib/python3.4/distutils/cmd.py", line 313, in run_command
self.distribution.run_command(command)
File "/usr/lib/python3.4/distutils/dist.py", line 974, in run_command
cmd_obj.run()
File "/home/rjk/odesk/flintrock/fvenv/lib/python3.4/site-packages/setuptools/command/build_ext.py", line 75, in run
_build_ext.run(self)
File "/usr/lib/python3.4/distutils/command/build_ext.py", line 339, in run
self.build_extensions()
File "/usr/lib/python3.4/distutils/command/build_ext.py", line 448, in build_extensions
self.build_extension(ext)
File "/home/rjk/odesk/flintrock/fvenv/lib/python3.4/site-packages/setuptools/command/build_ext.py", line 196, in build_extension
_build_ext.build_extension(self, ext)
File "/usr/lib/python3.4/distutils/command/build_ext.py", line 503, in build_extension
depends=ext.depends)
File "/usr/lib/python3.4/distutils/ccompiler.py", line 574, in compile
self._compile(obj, src, ext, cc_args, extra_postargs, pp_opts)
File "/usr/lib/python3.4/distutils/unixccompiler.py", line 118, in _compile
raise CompileError(msg)
distutils.errors.CompileError: command 'x86_64-linux-gnu-gcc' failed with exit status 1
According to Stack Overflow, you need to install libffi-dev
, and perhaps some related packages too.
Installing libffi-dev fixed the problem. I will do some testing tomorrow. Thanks
On Mon, Jul 24, 2017 at 10:50 PM, Nicholas Chammas <notifications@github.com
wrote:
According to Stack Overflow https://stackoverflow.com/questions/38109637/package-libffi-was-not-found-in-the-pkg-config-search-path-redhat6-5, you need to install libffi-dev, and perhaps some related packages too.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nchammas/flintrock/pull/204#issuecomment-317412506, or mute the thread https://github.com/notifications/unsubscribe-auth/ADi6waBAMzV6rkZOu95ISohpGtixCvglks5sRJMZgaJpZM4Ogv2g .
-- Robert Kruhlak Geelong, VIC Australia (M) +61 (0)481 091 244 (E) kruhly@gmail.com
I did 5 launches with 1 master, 1 worker and t2.micro with short time intervals in between today.
2 launches finished ~ 2.3 minutes with health report of 0 workers, logs showed successful configuration and registered worker. Spark-ui showed registered worker
1 launch finished ~8 minutes. Health report showed 1 worker as expected. Looks like it recovered as expected.
1 launch stuck on configuring master. Timed out waiting for spark master to come up and asks to terminate cluster. AWS console shows master and worker running. Two spark log files .out and .out.1. .out.1 log file shows binding exception. .out log file shows successful configuration of master and register of worker.
1 launch timed out waiting for spark master to come up and asks to terminate cluster. AWS console shows master and worker running. Three log files .out, .out.1, .out.2 all have the binding exception. spark-ui can't be reached. Manual call to start-all.sh, starts master and returns that worker is running and to stop first. spark-ui comes up with master only then serveral minutes later the worker appears.
Hope this helps.
2 launches finished ~ 2.3 minutes with health report of 0 workers
Yeah, sometimes the health report runs before all the workers are fully registered. It's an annoying issue but relatively harmless.
1 launch finished ~8 minutes. Health report showed 1 worker as expected. Looks like it recovered as expected.
So Flintrock showed the "Trying again..." message a few times followed by success? That's good.
1 launch stuck on configuring master. Timed out waiting for spark master to come up and asks to terminate cluster. AWS console shows master and worker running. Two spark log files .out and .out.1. .out.1 log file shows binding exception. .out log file shows successful configuration of master and register of worker.
1 launch timed out waiting for spark master to come up and asks to terminate cluster. AWS console shows master and worker running. Three log files .out, .out.1, .out.2 all have the binding exception. spark-ui can't be reached. Manual call to start-all.sh, starts master and returns that worker is running and to stop first. spark-ui comes up with master only then serveral minutes later the worker appears.
That's a bummer. This can definitely be addressed by waiting longer or trying more times, but at this point it feels like a very bad problem. And you're seeing this on launches, not restarts. My issues have typically been with restarts and only rarely with launches.
What region are you working in?
Are all the destroys/launches targeted at a cluster with the same name? What about if you try altering the name every time? I wonder if this problem is related to some kind of lease expiration/reacquisition on named resources in AWS.
From your sample of 5 launches you seem to be experiencing the problem much worse than even I am. :(
Reply in text
So Flintrock showed the "Trying again..." message a few times followed by
success? That's good.
I did not think to note the details as it was successful.
1 launch stuck on configuring master. Timed out waiting for spark master to come up and asks to terminate cluster. AWS console shows master and worker running. Two spark log files .out and .out.1. .out.1 log file shows binding exception. .out log file shows successful configuration of master and register of worker.
1 launch timed out waiting for spark master to come up and asks to terminate cluster. AWS console shows master and worker running. Three log files .out, .out.1, .out.2 all have the binding exception. spark-ui can't be reached. Manual call to start-all.sh, starts master and returns that worker is running and to stop first. spark-ui comes up with master only then serveral minutes later the worker appears.
That's a bummer. This can definitely be addressed by waiting longer or trying more times, but at this point it feels like a very bad problem. And you're seeing this on launches, not restarts. My issues have typically been with restarts and only rarely with launches.
Seeing it on launch and restarts but tested the PR using launches because we are currently using launch in our pipeline.
What region are you working in?
us-east-1
Are all the destroys/launches targeted at a cluster with the same name? What about if you try altering the name every time? I wonder if this problem is related to some kind of lease expiration/reacquisition on named resources in AWS.
Launches are with different names each time but the cluster parameters are the same which I don't think rules out the reuse of AWS resources.
From your sample of 5 launches you seem to be experiencing the problem much
worse than even I am. :(
The test was not a typical workflow because our batches vary in size and arrival time which changes the cluster parameters and time between launches. Fortunately, we don't see this poor of performance normally.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nchammas/flintrock/pull/204#issuecomment-317775183, or mute the thread https://github.com/notifications/unsubscribe-auth/ADi6wTWI8tlVKslk6qUXRJBVCsHnp7j-ks5sRgotgaJpZM4Ogv2g .
-- Robert Kruhlak Geelong, VIC Australia (M) +61 (0)481 091 244 (E) kruhly@gmail.com
@kruhly - I upped the number of attempts Flintrock will make to start the Spark and HDFS masters. Could you repeat your test of this branch and let me know how it goes?
I'm curious as to how bad the problem is for you, and whether with enough attempts things work out in the end. If they do, perhaps we should add an option for cases like these where the user is okay with waiting a while for Flintrock to retry starting the masters. But it's nonetheless maddening that this happens and I don't really have good leads as to why...
I have one additional thing I may try (and I'll ask you to repeat your test again, if that's OK 🙏), but I'll wait to see what you find first regarding the additional attempts.
Will do.
Ran 5 tests using the same settings as the previous tests without an issue after pulling the changes :-)
2 launches finished around 2 minutes with no signs of trying again
3 launches finished around 3:30 minutes with no signs of trying again
I will try another round tomorrow.
Huh... I'm baffled as to why the latest changes would have any impact then, since I just upped the number of retries and it sounds like you're not needing to use them. Maybe the bad run from a couple of days ago was a fluke?
I've set the attempt limit back down to where I'd ideally want to keep it. If this branch still works for you (which it seems like it should), then I think we have a good-enough solution for now. Let me know, and I'll make some final cleanups and merge this in.
On my side, I'm able to go through the full acceptance test suite without any errors, which is good.
Maybe the phase of the moon or pressure on AWS resources?
I ran 6 more launches with the same configuration as before. All successful.
3 launches first attempt
3 launches second attempt
I'm fine with the lower attempts. If AWS is having a bad day, we will just have to take a break ... Thanks
That's great news. I'll clean up this PR a bit and merge it in. Thanks for testing it out @kruhly! Your feedback was helpful in confirming that I've addressed the issue.
This PR band-aids a couple of related issues that have plagued Flintrock for a while: Spark and HDFS sometimes get stuck on master startup, especially after a cluster restart.
The issue is somehow related to EC2's network stack, but I don't have any solid leads on the root cause. Instead, I've implemented a good-enough workaround that finally gets the full acceptance suite passing on first try. It's been a while since I've been able to do that.
TODO:
This is probably as good a solution as I'm going to find for now, so I'm marking this PR as a fix for these two issues:
Fixes #129. Fixes #157.