intelsdi-x / snap-plugin-collector-mesos

Collects Apache Mesos cluster metrics
http://snap-telemetry.io/
Apache License 2.0
14 stars 19 forks source link

Unpredictable behavior if a metric is requested for feature that isn't enabled on a Mesos agent #19

Closed ghost closed 8 years ago

ghost commented 8 years ago

Currently, this Snap plugin running on a Mesos agent will produce metric types for all known metrics in the protobuf, regardless of whether or not the agent is actually capable of producing those metrics. Up to this point, we expected the user to not attempt to collect metrics for features that aren't enabled, but this is fragile and surprising to the user; for example, the following net_ metrics are returned by snapctl metric list despite the network isolator not being enabled:

./snapctl metric list | egrep 'net_(tx|rx)' | awk '{print $1}'
/intel/mesos/agent/*/*/net_rx_bytes
/intel/mesos/agent/*/*/net_rx_dropped
/intel/mesos/agent/*/*/net_rx_errors
/intel/mesos/agent/*/*/net_rx_packets
/intel/mesos/agent/*/*/net_tx_bytes
/intel/mesos/agent/*/*/net_tx_dropped
/intel/mesos/agent/*/*/net_tx_errors
/intel/mesos/agent/*/*/net_tx_packets

Because these metrics are available in Snap's metrics catalog, a user can successfully load a task that attempts to do the following:

  "workflow": {
    "collect": {
      "metrics": {
        "/intel/mesos/agent/*/*/net_rx_bytes": {},
        "/intel/mesos/agent/*/*/net_rx_dropped": {},
        "/intel/mesos/agent/*/*/net_rx_errors": {},
        "/intel/mesos/agent/*/*/net_rx_packets": {},
        "/intel/mesos/agent/*/*/net_tx_bytes": {},
        "/intel/mesos/agent/*/*/net_tx_dropped": {},
        "/intel/mesos/agent/*/*/net_tx_errors": {},
        "/intel/mesos/agent/*/*/net_tx_packets": {},
        ...

The task will start successfully, but will eventually be disabled. The logs produce the following message:

gob: gob: cannot encode nil pointer of type *uint64 inside interface

I have an outstanding TODO item to implement this here: https://github.com/intelsdi-x/snap-plugin-collector-mesos/blob/af87431dbbd225e5d4f8ebbaedec25b995911ff2/mesos/agent/agent.go#L86-L87

The configuration flags for a given agent can be determined like so:

$ curl -s 'http://10.134.26.13:5051/slave(1)/flags' | python -m json.tool
{
    "flags": {
        "appc_simple_discovery_uri_prefix": "http://",
        "appc_store_dir": "/tmp/mesos/store/appc",
        "authenticatee": "crammd5",
        "cgroups_cpu_enable_pids_and_tids_count": "false",
        "cgroups_enable_cfs": "false",
        "cgroups_hierarchy": "/sys/fs/cgroup",
        "cgroups_limit_swap": "false",
        "cgroups_root": "mesos",
        "container_disk_watch_interval": "15secs",
        "containerizers": "mesos",
        "default_role": "*",
        "disk_watch_interval": "1mins",
        "docker": "docker",
        "docker_kill_orphans": "true",
        "docker_registry": "https://registry-1.docker.io",
        "docker_remove_delay": "6hrs",
        "docker_socket": "/var/run/docker.sock",
        "docker_stop_timeout": "0ns",
        "docker_store_dir": "/tmp/mesos/store/docker",
        "egress_flow_classifier_parent": "root",
        "egress_unique_flow_per_container": "false",
        "enforce_container_disk_quota": "false",
        "ephemeral_ports_per_container": "1024",
        "executor_registration_timeout": "1mins",
        "executor_shutdown_grace_period": "5secs",
        "fetcher_cache_dir": "/tmp/mesos/fetch",
        "fetcher_cache_size": "2GB",
        "frameworks_home": "",
        "gc_delay": "1weeks",
        "gc_disk_headroom": "0.1",
        "hadoop_home": "",
        "help": "false",
        "hostname": "10.134.26.13",
        "hostname_lookup": "true",
        "image_provisioner_backend": "copy",
        "initialize_driver_logging": "true",
        "ip": "10.134.26.13",
        "isolation": "cgroups/cpu,cgroups/mem",
        "launcher_dir": "/usr/local/mesos/libexec/mesos",
        "logbufsecs": "0",
        "logging_level": "INFO",
        "master": "zk://10.134.17.70:2181/mesos",
        "network_enable_snmp_statistics": "false",
        "network_enable_socket_statistics_details": "false",
        "network_enable_socket_statistics_summary": "false",
        "oversubscribed_resources_interval": "15secs",
        "perf_duration": "10secs",
        "perf_interval": "1mins",
        "port": "5051",
        "qos_correction_interval_min": "0ns",
        "quiet": "false",
        "recover": "reconnect",
        "recovery_timeout": "15mins",
        "registration_backoff_factor": "1secs",
        "revocable_cpu_low_priority": "true",
        "sandbox_directory": "/mnt/mesos/sandbox",
        "strict": "true",
        "switch_user": "true",
        "systemd_enable_support": "true",
        "systemd_runtime_directory": "/run/systemd/system",
        "version": "false",
        "work_dir": "/var/lib/mesos"
    }
}

In this particular case, net_ metrics shouldn't be collected because network/port_mapping isn't listed in the isolation key. There are additional cases that need to be considered, such as perf_events, etc.

ghost commented 8 years ago

Started a branch where we can start working on this: https://github.com/intelsdi-x/snap-plugin-collector-mesos/tree/get-features. I'll be traveling for the next few days, so chances are I won't have a chance to get to this until later next week.

marcin-krolik commented 8 years ago

Such "discovery" mechanism would need to be implemented in GetMetricTypes. I have started discussion on configuration items in GetMetricTypes recently. If there would be consensus to go down that path, I think we would need to find a way to handle the case this enhancement describes differently. Please take a look at Snap#936 and share your thought on that.

My point of view is, that there will be "exceptions" which will require some kind of discovery mechanism at early stage, but most of the plugins should work without it.

ghost commented 8 years ago

Thanks @marcin-krolik, I'll comment on the Snap proposal. I'm going to continue down this path for now and open a PR, and we can revisit later if needed.

ghost commented 8 years ago

This was resolved in #20.