thecodeteam / mesos-module-dvdi

Mesos Docker Volume Driver Isolator module
Apache License 2.0
77 stars 16 forks source link

0.4.3 on mesos-0.28.2 crashes immediately on startup #107

Closed justinclayton closed 8 years ago

justinclayton commented 8 years ago
ABORT: (../../3rdparty/libprocess/3rdparty/stout/include/stout/result.hpp:114): Result::get() but state == NONE
Jun 13 19:17:13 slave-1 mesos-slave[22560]: *** Aborted at 1465845433 (unix time) try "date -d @1465845433" if you are using GNU date ***
Jun 13 19:17:13 slave-1 mesos-slave[22560]: PC: @     0x7f5180a0f5f7 __GI_raise
Jun 13 19:17:13 slave-1 mesos-slave[22560]: *** SIGABRT (@0x57f8) received by PID 22520 (TID 0x7f517a2dd700) from PID 22520; stack trace: ***
Jun 13 19:17:13 slave-1 mesos-slave[22560]:     @     0x7f51812c8100 (unknown)
Jun 13 19:17:13 slave-1 mesos-slave[22560]:     @     0x7f5180a0f5f7 __GI_raise
Jun 13 19:17:13 slave-1 mesos-slave[22560]:     @     0x7f5180a10ce8 __GI_abort
Jun 13 19:17:13 slave-1 mesos-slave[22560]:     @           0x40b71c _Abort()
Jun 13 19:17:13 slave-1 mesos-slave[22560]:     @           0x40b75c _Abort()
Jun 13 19:17:13 slave-1 mesos-slave[22560]:     @     0x7f518218a2db Result<>::get()
Jun 13 19:17:13 slave-1 mesos-slave[22560]:     @     0x7f517aaf724f mesos::slave::DockerVolumeDriverIsolator::recover()
Jun 13 19:17:13 slave-1 mesos-slave[22560]:     @     0x7f51822a6a55 mesos::internal::slave::MesosContainerizerProcess::recoverIsolators()
Jun 13 19:17:13 slave-1 mesos-slave[22560]:     @     0x7f51822b0207 mesos::internal::slave::MesosContainerizerProcess::_recover()
Jun 13 19:17:13 slave-1 mesos-slave[22560]:     @     0x7f51822cd797 _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchI7NothingN5mesos8internal5slave25MesosContainerizerProcessERKSt4listINS6_5slave14ContainerStateESaISC_EERK7hashsetINS6_11ContainerIDESt4hashISI_ESt8equal_toISI_EESE_SN_EENS0_6FutureIT_EERKNS0_3PIDIT0_EEMSU_FSS_T1_T2_ET3_T4_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_
Jun 13 19:17:13 slave-1 mesos-slave[22560]:     @     0x7f518276e8a1 process::ProcessManager::resume()
Jun 13 19:17:13 slave-1 mesos-slave[22560]:     @     0x7f518276eba7 _ZNSt6thread5_ImplISt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS6_EEEvEEE6_M_runEv
Jun 13 19:17:13 slave-1 mesos-slave[22560]:     @     0x7f5181066220 (unknown)
Jun 13 19:17:13 slave-1 mesos-slave[22560]:     @     0x7f51812c0dc5 start_thread
Jun 13 19:17:13 slave-1 mesos-slave[22560]:     @     0x7f5180ad021d __clone
Jun 13 19:17:13 slave-1 systemd[1]: mesos-slave.service: main process exited, code=killed, status=6/ABRT

Command used:

Jun 13 19:28:13 slave-1 mesos-slave[27764]: I0613 19:28:13.036092 27723 slave.cpp:194] Flags at startup: --appc_simple_discovery_uri_prefix="http://" --appc_store_dir="/tmp/mesos/store/appc" --attributes="flavor:m1-slave;java:1.8.0;os:centos7" --authenticatee="crammd5" --cgroups_cpu_enable_pids_and_tids_count="false" --cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup" --cgroups_limit_swap="false" --cgroups_root="mesos" --container_disk_watch_interval="15secs" --containerizers="docker,mesos" --default_role="*" --disk_watch_interval="1mins" --docker="docker" --docker_kill_orphans="true" --docker_registry="https://registry-1.docker.io" --docker_remove_delay="6hrs" --docker_socket="/var/run/docker.sock" --docker_stop_timeout="15secs" --docker_store_dir="/tmp/mesos/store/docker" --enforce_container_disk_quota="true" --executor_environment_variables="{"DATACENTER":"foo","JAVA_HOME":"\/usr\/jdk1.8.0_31"}" --executor_registration_timeout="5mins" --executor_shutdown_grace_period="5secs" --fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size=
Jun 13 19:28:13 slave-1 mesos-slave[27764]: "2GB" --frameworks_home="" --gc_delay="1weeks" --gc_disk_headroom="0.1" --hadoop_home="/usr/hadoop-2.6.3" --help="false" --hostname="slave-1.redacted.fqdn" --hostname_lookup="true" --image_provisioner_backend="copy" --initialize_driver_logging="true" --isolation="com_emccode_mesos_DockerVolumeDriverIsolator" --launcher_dir="/usr/libexec/mesos" --logbufsecs="0" --logging_level="INFO" --master="zk://10.211.194.118:2181,10.211.194.121:2181,10.211.194.122:2181/mesos" --modules="libraries {
Jun 13 19:28:13 slave-1 mesos-slave[27764]:   file: "/usr/lib/libmesos_dvdi_isolator-0.28.2.so"
Jun 13 19:28:13 slave-1 mesos-slave[27764]:   modules {
Jun 13 19:28:13 slave-1 mesos-slave[27764]:     name: "com_emccode_mesos_DockerVolumeDriverIsolator"
Jun 13 19:28:13 slave-1 mesos-slave[27764]:   }
Jun 13 19:28:13 slave-1 mesos-slave[27764]: }
Jun 13 19:28:13 slave-1 mesos-slave[27764]: " --oversubscribed_resources_interval="15secs" --perf_duration="10secs" --perf_interval="1mins" --port="5050" --qos_correction_interval_min="0ns" --quiet="false" --recover="reconnect" --recovery_timeout="15mins" --registration_backoff_factor="1secs" --resources="ports:[1025-8999,9011-65535]" --revocable_cpu_low_priority="true" --sandbox_directory="/mnt/mesos/sandbox" --strict="true" --switch_user="true" --systemd_enable_support="true" --systemd_runtime_directory="/run/systemd/system" --version="false" --work_dir="/var/lib/mesos"```

Works great on 0.28.1.
dvonthenen commented 8 years ago

I will take a look at this.

dvonthenen commented 8 years ago

Hi @justinclayton, I have been testing the binary in the 0.4.3 release and I haven't run into an issues. I am using a minimal configuration to eliminate any possible conflicts between the 0.28.1 and 0.28.2 versions. I have tried running the agent in both service and command line modes and everything seems to work ok.

This is my minimal configuration:

nohup /usr/sbin/mesos-slave \
--master=zk://<replace with your zookeeper config>/mesos \
--containerizers=docker,mesos --work_dir=/tmp/mesos \
--modules=file:///usr/lib/dvdi-mod.json \
--isolation="com_emccode_mesos_DockerVolumeDriverIsolator" &

Would it be possible to stop all your agents and run one agent on one of your agent nodes in command line mode (above) and see if this works for you? If that does work, continue running in command line mode but add back your other command line options until we hit an issue. Running in command line mode will also create a nohup.out file in your current working directory that can easily capture the crash. I have a feeling that the issue is cause with a behavior change in mesos in one of the command line options you are using.

dvonthenen commented 8 years ago

I have been looking at diffs between 0.28.1 and 0.28.2 and it looks like there was significant changes specifically in the mesos linux filesystem isolator. Continuing to look into this.

dvonthenen commented 8 years ago

I think I see the issue, but I need to verify it. If this is what I think the issue is, I don't see how the 0.28.2 binary is working on your 0.28.1 configuration. It should also be failing there as well with the same issue providing the configuration flags on your 0.28.1 and 0.28.2 are the same.

dvonthenen commented 8 years ago

@justinclayton I have a test binary that I would like you to take a look at. I believe this should fix the issue. I think the issue stems from one of two scenarios:

I believe that changes to the linux filesystem isolator and the order in which the collection of isolators are being called is causing the working directory not to have the necessary checkpoint data.

libmesos_dvdi_isolator-0.28.2.so.zip

justinclayton commented 8 years ago

@dvonthenen Your test binary worked perfectly. Thanks!

And just to be clear: When I said earlier that 0.28.1 did work, I meant dvdi 0.4.2-0.28.1 running on mesos 0.28.1. I never tried to mix and match them.

dvonthenen commented 8 years ago

@justinclayton thanks again for your help. I will be spinning up another release containing this fix. Will close this issue out when the release is available.

dvonthenen commented 8 years ago

Release 0.4.4 has been published with this fix

https://github.com/emccode/mesos-module-dvdi/releases/tag/v0.4.4