mesosphere / net-modules

Apache Mesos modules for network isolation.
55 stars 28 forks source link

debugging Mesos and net-modules #90

Open unclejack opened 8 years ago

unclejack commented 8 years ago

The logging provided by the Mesos slave, net-modules and Marathon doesn't provide many hints regarding the reason behind the startup failure of an application.

A setup which uses Marathon 0.14.0, Mesos 0.26 and the latest net-modules master exhibits the following behaviour:

Observations, things I've tried so far and things I've investigated:

Given the issue described above, I have a few questions

I've read all the documentation I could read and I've also read some of the code to try to figure out if I'm missing something obvious.

Please let me know if I should also provide some logs.

karya0 commented 8 years ago

@unclejack: Can you provide the slave logs corresponding to the failed logs? If possible, I'd like to take a look at the sequence of events in there. We can then talk about adding more logging to the module and see what would help streamline debugging even further.

unclejack commented 8 years ago

@karya0 Of course, you'll have the logs, the full JSON config of the application and all the relevant information in a few hours.

unclejack commented 8 years ago

The full log of the mesos-slave can be found below:

I0201 11:44:00.499222 18350 main.cpp:190] Build: 2016-01-20 09:54:18 by root
I0201 11:44:00.499286 18350 main.cpp:192] Version: 0.26.0
I0201 11:44:00.499372 18350 containerizer.cpp:142] Using isolation: com_mesosphere_mesos_NetworkIsolator,filesystem/posix
I0201 11:44:00.499851 18350 network_isolator.cpp:431] Loading Network Isolator module
I0201 11:44:00.530853 18350 linux_launcher.cpp:103] Using /sys/fs/cgroup/freezer as the freezer hierarchy for the Linux launcher
I0201 11:44:00.537849 18350 main.cpp:277] Starting Mesos slave
I0201 11:44:00.538010 18350 slave.cpp:191] Slave started on 1)@10.0.2.15:5051
I0201 11:44:00.538019 18350 slave.cpp:192] Flags at startup: --appc_store_dir="/tmp/mesos/store/appc" --authenticatee="crammd5" --cgroups_cpu_enable_pids_and_tids_count="false" --cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup" --cgroups_limit_swap="false" --cgroups_root="mesos" --container_disk_watch_interval="15secs" --containerizers="mesos" --default_role="*" --disk_watch_interval="1mins" --docker="docker" --docker_auth_server="auth.docker.io" --docker_auth_server_port="443" --docker_kill_orphans="true" --docker_local_archives_dir="/tmp/mesos/images/docker" --docker_puller="local" --docker_puller_timeout="60" --docker_registry="registry-1.docker.io" --docker_registry_port="443" --docker_remove_delay="6hrs" --docker_socket="/var/run/docker.sock" --docker_stop_timeout="0ns" --docker_store_dir="/tmp/mesos/store/docker" --enforce_container_disk_quota="false" --executor_registration_timeout="1mins" --executor_shutdown_grace_period="5secs" --fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB" --frameworks_home="" --gc_delay="1weeks" --gc_disk_headroom="0.1" --hadoop_home="" --help="false" --hooks="com_mesosphere_mesos_NetworkHook" --hostname_lookup="true" --image_provisioner_backend="copy" --initialize_driver_logging="true" --isolation="com_mesosphere_mesos_NetworkIsolator" --launcher_dir="/mesos/build/src" --logbufsecs="0" --logging_level="INFO" --master="zk://localhost:2181/mesos" --modules="libraries {
  file: "/usr/local/lib/mesos/libmesos_network_isolator.so"
  modules {
    name: "com_mesosphere_mesos_NetworkIsolator"
    parameters {
      key: "isolator_command"
      value: "/calico/calico_mesos"
    }
    parameters {
      key: "ipam_command"
      value: "/calico/calico_mesos"
    }
  }
  modules {
    name: "com_mesosphere_mesos_NetworkHook"
  }
}
" --oversubscribed_resources_interval="15secs" --perf_duration="10secs" --perf_interval="1mins" --port="5051" --qos_correction_interval_min="0ns" --quiet="false" --recover="reconnect" --recovery_timeout="15mins" --registration_backoff_factor="1secs" --revocable_cpu_low_priority="true" --sandbox_directory="/mnt/mesos/sandbox" --strict="true" --switch_user="true" --systemd_runtime_directory="/run/systemd/system" --version="false" --work_dir="/tmp/mesos"
I0201 11:44:00.538449 18350 slave.cpp:392] Slave resources: cpus(*):1; mem(*):1985; disk(*):35164; ports(*):[31000-32000]
I0201 11:44:00.538480 18350 slave.cpp:400] Slave attributes: [  ]
I0201 11:44:00.538485 18350 slave.cpp:405] Slave hostname: vagrant-ubuntu-trusty-64
I0201 11:44:00.538487 18350 slave.cpp:410] Slave checkpoint: true
2016-02-01 11:44:00,539:18350(0x7febf9f72700):ZOO_INFO@log_env@712: Client environment:zookeeper.version=zookeeper C client 3.4.5
2016-02-01 11:44:00,539:18350(0x7febf9f72700):ZOO_INFO@log_env@716: Client environment:host.name=vagrant-ubuntu-trusty-64
2016-02-01 11:44:00,539:18350(0x7febf9f72700):ZOO_INFO@log_env@723: Client environment:os.name=Linux
2016-02-01 11:44:00,539:18350(0x7febf9f72700):ZOO_INFO@log_env@724: Client environment:os.arch=3.13.0-76-generic
2016-02-01 11:44:00,539:18350(0x7febf9f72700):ZOO_INFO@log_env@725: Client environment:os.version=#120-Ubuntu SMP Mon Jan 18 15:59:10 UTC 2016
2016-02-01 11:44:00,539:18350(0x7febf9f72700):ZOO_INFO@log_env@733: Client environment:user.name=vagrant
2016-02-01 11:44:00,539:18350(0x7febf9f72700):ZOO_INFO@log_env@741: Client environment:user.home=/root
2016-02-01 11:44:00,539:18350(0x7febf9f72700):ZOO_INFO@log_env@753: Client environment:user.dir=/root/net-modules
2016-02-01 11:44:00,539:18350(0x7febf9f72700):ZOO_INFO@zookeeper_init@786: Initiating client connection, host=localhost:2181 sessionTimeout=10000 watcher=0x7fec03589b10 sessionId=0 sessionPasswd=<null> context=0x7febf0003d50 flags=0
I0201 11:44:00.544509 18414 state.cpp:54] Recovering state from '/tmp/mesos/meta'
2016-02-01 11:44:00,581:18350(0x7febf572e700):ZOO_INFO@check_events@1703: initiated connection to server [127.0.0.1:2181]
I0201 11:44:00.581164 18417 status_update_manager.cpp:202] Recovering status update manager
I0201 11:44:00.581164 18417 containerizer.cpp:384] Recovering containerizer
I0201 11:44:00.581691 18417 slave.cpp:4230] Finished recovery
2016-02-01 11:44:00,601:18350(0x7febf572e700):ZOO_INFO@check_events@1750: session establishment complete on server [127.0.0.1:2181], sessionId=0x1529ca3f0700004, negotiated timeout=10000
I0201 11:44:00.601882 18417 group.cpp:331] Group process (group(1)@10.0.2.15:5051) connected to ZooKeeper
I0201 11:44:00.601898 18417 group.cpp:805] Syncing group operations: queue size (joins, cancels, datas) = (0, 0, 0)
I0201 11:44:00.601904 18417 group.cpp:403] Trying to create path '/mesos' in ZooKeeper
I0201 11:44:00.629608 18419 detector.cpp:156] Detected a new leader: (id='0')
I0201 11:44:00.629708 18419 group.cpp:674] Trying to get '/mesos/json.info_0000000000' in ZooKeeper
I0201 11:44:00.635550 18419 detector.cpp:482] A new leading master (UPID=master@10.0.2.15:5050) is detected
I0201 11:44:00.635598 18419 slave.cpp:729] New master detected at master@10.0.2.15:5050
I0201 11:44:00.635702 18419 slave.cpp:754] No credentials provided. Attempting to register without authentication
I0201 11:44:00.635721 18419 slave.cpp:765] Detecting new master
I0201 11:44:00.635737 18419 status_update_manager.cpp:176] Pausing sending status updates
I0201 11:44:01.947412 18419 slave.cpp:904] Registered with master master@10.0.2.15:5050; given slave ID 16fa4ecd-2025-48b0-9031-74f08c512b2d-S0
I0201 11:44:01.947597 18403 status_update_manager.cpp:183] Resuming sending status updates
I0201 11:44:01.947721 18419 slave.cpp:963] Forwarding total oversubscribed resources 
I0201 11:45:00.540904 18403 slave.cpp:4039] Current disk usage 13.53%. Max allowed age: 5.352910639842731days
I0201 11:46:00.378325 18403 slave.cpp:1294] Got assigned task testfoo_probe.5d441717-c8d9-11e5-983a-024293d08dbe for framework 16fa4ecd-2025-48b0-9031-74f08c512b2d-0000
I0201 11:46:00.380084 18403 slave.cpp:1410] Launching task testfoo_probe.5d441717-c8d9-11e5-983a-024293d08dbe for framework 16fa4ecd-2025-48b0-9031-74f08c512b2d-0000
I0201 11:46:00.380933 18403 paths.cpp:436] Trying to chown '/tmp/mesos/slaves/16fa4ecd-2025-48b0-9031-74f08c512b2d-S0/frameworks/16fa4ecd-2025-48b0-9031-74f08c512b2d-0000/executors/testfoo_probe.5d441717-c8d9-11e5-983a-024293d08dbe/runs/b16d1bb2-dc6e-4acc-a520-074957065147' to user 'root'
I0201 11:46:00.385494 18403 slave.cpp:4999] Launching executor testfoo_probe.5d441717-c8d9-11e5-983a-024293d08dbe of framework 16fa4ecd-2025-48b0-9031-74f08c512b2d-0000 with resources cpus(*):0.1; mem(*):32 in work directory '/tmp/mesos/slaves/16fa4ecd-2025-48b0-9031-74f08c512b2d-S0/frameworks/16fa4ecd-2025-48b0-9031-74f08c512b2d-0000/executors/testfoo_probe.5d441717-c8d9-11e5-983a-024293d08dbe/runs/b16d1bb2-dc6e-4acc-a520-074957065147'
I0201 11:46:00.385627 18403 slave.cpp:1628] Queuing task 'testfoo_probe.5d441717-c8d9-11e5-983a-024293d08dbe' for executor 'testfoo_probe.5d441717-c8d9-11e5-983a-024293d08dbe' of framework 16fa4ecd-2025-48b0-9031-74f08c512b2d-0000
I0201 11:46:00.385772 18403 containerizer.cpp:618] Starting container 'b16d1bb2-dc6e-4acc-a520-074957065147' for executor 'testfoo_probe.5d441717-c8d9-11e5-983a-024293d08dbe' of framework '16fa4ecd-2025-48b0-9031-74f08c512b2d-0000'
I0201 11:46:00.386060 18403 network_isolator.cpp:211] NetworkIsolator::prepare for container: b16d1bb2-dc6e-4acc-a520-074957065147
I0201 11:46:00.386616 18403 network_isolator.cpp:314] Sending IP request command to IPAM
I0201 11:46:00.387308 18403 network_isolator.cpp:104] Sending command to /calico/calico_mesos: {"args":{"hostname":"vagrant-ubuntu-trusty-64","netgroups":["public"],"num_ipv4":1,"num_ipv6":0,"uid":"38b00d50-9459-4a38-8b3c-f20d73c2a2c7"},"command":"allocate"}
I0201 11:46:00.401631 18403 network_isolator.cpp:125] Got response from /calico/calico_mesos: {"ipv4":"192.168.11.1"}
I0201 11:46:00.401752 18403 network_isolator.cpp:329] IP(s) 192.168.11.1 allocated with IPAM.
I0201 11:46:00.403147 18403 linux_launcher.cpp:365] Cloning child process with flags = CLONE_NEWNET
I0201 11:46:00.407737 18403 containerizer.cpp:851] Checkpointing executor's forked pid 19018 to '/tmp/mesos/meta/slaves/16fa4ecd-2025-48b0-9031-74f08c512b2d-S0/frameworks/16fa4ecd-2025-48b0-9031-74f08c512b2d-0000/executors/testfoo_probe.5d441717-c8d9-11e5-983a-024293d08dbe/runs/b16d1bb2-dc6e-4acc-a520-074957065147/pids/forked.pid'
I0201 11:46:00.408087 18416 network_isolator.cpp:378] Sending isolate command to Isolator
I0201 11:46:00.408524 18416 network_isolator.cpp:104] Sending command to /calico/calico_mesos: {"args":{"container_id":"b16d1bb2-dc6e-4acc-a520-074957065147","hostname":"vagrant-ubuntu-trusty-64","ipv4_addrs":["192.168.11.1"],"netgroups":["public"],"pid":19018},"command":"isolate"}
I0201 11:46:00.485188 18416 network_isolator.cpp:125] Got response from /calico/calico_mesos: {}
I0201 11:46:00.541923 18416 slave.cpp:4039] Current disk usage 13.49%. Max allowed age: 5.356046553481574days
I0201 11:46:05.731503 18417 containerizer.cpp:1257] Executor for container 'b16d1bb2-dc6e-4acc-a520-074957065147' has exited
I0201 11:46:05.731542 18417 containerizer.cpp:1074] Destroying container 'b16d1bb2-dc6e-4acc-a520-074957065147'
I0201 11:46:05.733041 18417 cgroups.cpp:2429] Freezing cgroup /sys/fs/cgroup/freezer/mesos/b16d1bb2-dc6e-4acc-a520-074957065147
I0201 11:46:05.733933 18417 cgroups.cpp:1411] Successfully froze cgroup /sys/fs/cgroup/freezer/mesos/b16d1bb2-dc6e-4acc-a520-074957065147 after 868096ns
I0201 11:46:05.738904 18413 cgroups.cpp:2447] Thawing cgroup /sys/fs/cgroup/freezer/mesos/b16d1bb2-dc6e-4acc-a520-074957065147
I0201 11:46:05.739802 18413 cgroups.cpp:1440] Successfullly thawed cgroup /sys/fs/cgroup/freezer/mesos/b16d1bb2-dc6e-4acc-a520-074957065147 after 878080ns
I0201 11:46:05.745223 18419 network_isolator.cpp:407] Requesting IPAM to release IPs: 192.168.11.1 
I0201 11:46:05.745654 18419 network_isolator.cpp:104] Sending command to /calico/calico_mesos: {"args":{"ips":["192.168.11.1"]},"command":"release"}
I0201 11:46:05.758720 18419 network_isolator.cpp:125] Got response from /calico/calico_mesos: {}
I0201 11:46:05.759049 18419 network_isolator.cpp:104] Sending command to /calico/calico_mesos: {"args":{"container_id":"b16d1bb2-dc6e-4acc-a520-074957065147","hostname":"vagrant-ubuntu-trusty-64"},"command":"cleanup"}
I0201 11:46:05.770232 18419 network_isolator.cpp:125] Got response from /calico/calico_mesos: {}
I0201 11:46:05.770485 18419 slave.cpp:3553] Executor 'testfoo_probe.5d441717-c8d9-11e5-983a-024293d08dbe' of framework 16fa4ecd-2025-48b0-9031-74f08c512b2d-0000 terminated with signal Killed
I0201 11:46:05.771064 18419 slave.cpp:2762] Handling status update TASK_FAILED (UUID: 9ba6e7dc-22d1-477c-80ff-dc2f17b1bb2a) for task testfoo_probe.5d441717-c8d9-11e5-983a-024293d08dbe of framework 16fa4ecd-2025-48b0-9031-74f08c512b2d-0000 from @0.0.0.0:0
I0201 11:46:05.771103 18419 network_isolator.cpp:480] NetworkHook::task status label decorator
I0201 11:46:05.771112 18419 network_isolator.cpp:512] NetworkHook:: added ip address(es) 192.168.11.1 
W0201 11:46:05.771200 18419 containerizer.cpp:966] Ignoring update for unknown container: b16d1bb2-dc6e-4acc-a520-074957065147
I0201 11:46:05.771255 18419 status_update_manager.cpp:322] Received status update TASK_FAILED (UUID: 9ba6e7dc-22d1-477c-80ff-dc2f17b1bb2a) for task testfoo_probe.5d441717-c8d9-11e5-983a-024293d08dbe of framework 16fa4ecd-2025-48b0-9031-74f08c512b2d-0000
I0201 11:46:05.771409 18419 status_update_manager.cpp:826] Checkpointing UPDATE for status update TASK_FAILED (UUID: 9ba6e7dc-22d1-477c-80ff-dc2f17b1bb2a) for task testfoo_probe.5d441717-c8d9-11e5-983a-024293d08dbe of framework 16fa4ecd-2025-48b0-9031-74f08c512b2d-0000
I0201 11:46:05.772874 18415 slave.cpp:3087] Forwarding the update TASK_FAILED (UUID: 9ba6e7dc-22d1-477c-80ff-dc2f17b1bb2a) for task testfoo_probe.5d441717-c8d9-11e5-983a-024293d08dbe of framework 16fa4ecd-2025-48b0-9031-74f08c512b2d-0000 to master@10.0.2.15:5050
I0201 11:46:05.835114 18418 status_update_manager.cpp:394] Received status update acknowledgement (UUID: 9ba6e7dc-22d1-477c-80ff-dc2f17b1bb2a) for task testfoo_probe.5d441717-c8d9-11e5-983a-024293d08dbe of framework 16fa4ecd-2025-48b0-9031-74f08c512b2d-0000
I0201 11:46:05.835155 18418 status_update_manager.cpp:826] Checkpointing ACK for status update TASK_FAILED (UUID: 9ba6e7dc-22d1-477c-80ff-dc2f17b1bb2a) for task testfoo_probe.5d441717-c8d9-11e5-983a-024293d08dbe of framework 16fa4ecd-2025-48b0-9031-74f08c512b2d-0000
I0201 11:46:05.836791 18418 slave.cpp:3657] Cleaning up executor 'testfoo_probe.5d441717-c8d9-11e5-983a-024293d08dbe' of framework 16fa4ecd-2025-48b0-9031-74f08c512b2d-0000
I0201 11:46:05.836951 18418 slave.cpp:3745] Cleaning up framework 16fa4ecd-2025-48b0-9031-74f08c512b2d-0000
I0201 11:46:05.836983 18418 gc.cpp:56] Scheduling '/tmp/mesos/slaves/16fa4ecd-2025-48b0-9031-74f08c512b2d-S0/frameworks/16fa4ecd-2025-48b0-9031-74f08c512b2d-0000/executors/testfoo_probe.5d441717-c8d9-11e5-983a-024293d08dbe/runs/b16d1bb2-dc6e-4acc-a520-074957065147' for gc 6.9999903137037days in the future
I0201 11:46:05.837008 18418 gc.cpp:56] Scheduling '/tmp/mesos/slaves/16fa4ecd-2025-48b0-9031-74f08c512b2d-S0/frameworks/16fa4ecd-2025-48b0-9031-74f08c512b2d-0000/executors/testfoo_probe.5d441717-c8d9-11e5-983a-024293d08dbe' for gc 6.99999031342518days in the future
I0201 11:46:05.837018 18418 gc.cpp:56] Scheduling '/tmp/mesos/meta/slaves/16fa4ecd-2025-48b0-9031-74f08c512b2d-S0/frameworks/16fa4ecd-2025-48b0-9031-74f08c512b2d-0000/executors/testfoo_probe.5d441717-c8d9-11e5-983a-024293d08dbe/runs/b16d1bb2-dc6e-4acc-a520-074957065147' for gc 6.99999031328889days in the future
I0201 11:46:05.837025 18418 gc.cpp:56] Scheduling '/tmp/mesos/meta/slaves/16fa4ecd-2025-48b0-9031-74f08c512b2d-S0/frameworks/16fa4ecd-2025-48b0-9031-74f08c512b2d-0000/executors/testfoo_probe.5d441717-c8d9-11e5-983a-024293d08dbe' for gc 6.99999031319407days in the future
I0201 11:46:05.837034 18418 gc.cpp:56] Scheduling '/tmp/mesos/slaves/16fa4ecd-2025-48b0-9031-74f08c512b2d-S0/frameworks/16fa4ecd-2025-48b0-9031-74f08c512b2d-0000' for gc 6.99999031284741days in the future
I0201 11:46:05.837041 18418 gc.cpp:56] Scheduling '/tmp/mesos/meta/slaves/16fa4ecd-2025-48b0-9031-74f08c512b2d-S0/frameworks/16fa4ecd-2025-48b0-9031-74f08c512b2d-0000' for gc 6.99999031276741days in the future
I0201 11:46:05.837047 18418 status_update_manager.cpp:284] Closing status update streams for framework 16fa4ecd-2025-48b0-9031-74f08c512b2d-0000
I0201 11:47:00.386456 18415 slave.cpp:3947] Framework 16fa4ecd-2025-48b0-9031-74f08c512b2d-0000 seems to have exited. Ignoring registration timeout for executor 'testfoo_probe.5d441717-c8d9-11e5-983a-024293d08dbe'
I0201 11:47:00.543108 18418 slave.cpp:4039] Current disk usage 13.08%. Max allowed age: 5.384249413155556days
I0201 11:48:00.545049 18413 slave.cpp:4039] Current disk usage 13.08%. Max allowed age: 5.384248734386365days
I0201 11:49:00.546118 18414 slave.cpp:4039] Current disk usage 13.08%. Max allowed age: 5.384237874079386days
I0201 11:50:00.546681 18403 slave.cpp:4039] Current disk usage 13.08%. Max allowed age: 5.384237195310209days

The JSON configuration I was using is:

{
   "apps" : [
      {
         "backoffSeconds" : 10000,
         "mem" : 32,
         "id" : "probe",
         "uris" : [
            "file:///vagrant/foo"
         ],
         "cpus" : 0.1,
         "cmd" : "./foo",
         "ipAddress" : {
            "groups" : [
               "public"
            ]
         }
      }
   ],
   "id" : "/testfoo"
}

I've used backoffSeconds to make it avoid spamming the logs with too many attempts to run.

Mesos: 0.26 Marathon: 0.14.0 (using the official Docker image from mesosphere) net-modules: current master

Please let me know if there are any other bits of information I should provide.

update: /calico/calico_mesos isn't the Calico isolator, it's just a dummy isolator used for testing. It wasn't important to change that during testing with a dummy isolator.

unclejack commented 8 years ago

@karya0 Have you had a chance to take a look at this yet?

karya0 commented 8 years ago

@unclejack: Sorry for the delay in getting back to you. Can you tell me a bit more about your dummy isolator implementation? Is it going to create veth pairs for the container? Without it, the executor won't be able to bind to the the given IP address and hence it would fail to connect to the agent and eventually exit.

unclejack commented 8 years ago

@karya0 My dummy isolator doesn't set up veth pairs, but this isn't documented in the net-modules API. Could you tell me more about this and any other requirements around these side effects, please?

karya0 commented 8 years ago

Okay, so once the isolator is activate, it would ask the Mesos agent to create a network namespace for the container being launched. Thus, someone needs to setup the network for the new container. In the case of calico, the calico binary does all that as part of the "isolation" command. It also creates the appropriate routes to make it possible for the executor to talk to the agent and any other nodes as needed.

karya0 commented 8 years ago

We can work together on getting a README.md that is more catered towards writing a new isolation-service-provider. Does that sound reasonable?

ozdanborne commented 8 years ago

There's some info on it here, but its a tad outdated: https://github.com/mesosphere/net-modules/blob/master/docs/api.md#network-virtualizer-api

unclejack commented 8 years ago

@karya0 That sounds good. I'll write my implementation and send some PRs to update the documentation.

@djosborne It's unfortunate that the API documentation is a bit outdated. I've been using that documentation to implement an IPAM and isolator. Up to date documentation would have saved me lots of hours of debugging.

karya0 commented 8 years ago

SGTM.

sanfern commented 8 years ago

I am facing the similar issue. Can you point me to latest doc.