nebula-orchestrator / worker

The worker node manager container which manages nebula nodes
https://nebula-orchestrator.github.io/
GNU General Public License v3.0
34 stars 10 forks source link

How to check if edge device is updated successfully? #63

Closed Sharvin26 closed 5 years ago

Sharvin26 commented 5 years ago

I have configured Nebula Worker on the Raspberry Pi and Mongo, Nebula Manager, Reporter, Kafka and Zookeeper on the VPS ( Which is a Ubuntu 18.04 machine )

Expected/Wanted Behavior

Status if the Remote device is updated or the update failed using an API call.

Actual Behavior

I am referring this Nebula Documentation https://nebula.readthedocs.io/en/latest/api/general/

I have tried the List a filtered paginated view of the optional reports system section from this documentation.

I get the following information when I try this API http://<my_vps_url>/api/v2/reports?page_size=1&amp; =>

{
    "data": [
        {
            "_id": {
                "$oid": "5d20785900bb37cdd5352c5c"
            },
            "memory_usage": {
                "total": 926,
                "used": 159,
                "free": 91,
                "available": 680
            },
            "root_disk_usage": {
                "total": 14890,
                "used": 2140,
                "free": 12115
            },
            "cpu_usage": {
                "cores": 4,
                "used_percent": 0.6
            },
            "cron_jobs_containers": [],
            "apps_containers": [
                {
                    "read": "0001-01-01T00:00:00Z",
                    "preread": "0001-01-01T00:00:00Z",
                    "pids_stats": {},
                    "blkio_stats": {
                        "io_service_bytes_recursive": null,
                        "io_serviced_recursive": null,
                        "io_queue_recursive": null,
                        "io_service_time_recursive": null,
                        "io_wait_time_recursive": null,
                        "io_merged_recursive": null,
                        "io_time_recursive": null,
                        "sectors_recursive": null
                    },
                    "num_procs": 0,
                    "storage_stats": {},
                    "cpu_stats": {
                        "cpu_usage": {
                            "total_usage": 0,
                            "usage_in_kernelmode": 0,
                            "usage_in_usermode": 0
                        },
                        "throttling_data": {
                            "periods": 0,
                            "throttled_periods": 0,
                            "throttled_time": 0
                        }
                    },
                    "precpu_stats": {
                        "cpu_usage": {
                            "total_usage": 0,
                            "usage_in_kernelmode": 0,
                            "usage_in_usermode": 0
                        },
                        "throttling_data": {
                            "periods": 0,
                            "throttled_periods": 0,
                            "throttled_time": 0
                        }
                    },
                    "memory_stats": {},
                    "name": "/example-1",
                    "id": "dafc6f075726d61a6b2bc3feffe0cecb738bd43d04eca89c6f3fa72dd9d50193"
                }
            ],
            "current_device_group_config": {
                "status_code": 200,
                "reply": {
                    "apps": [
                        {
                            "app_id": 1,
                            "app_name": "example",
                            "starting_ports": [
                                8080
                            ],
                            "containers_per": {
                                "server": 1
                            },
                            "env_vars": {},
                            "docker_image": "<my_registry_url>/flask",
                            "running": true,
                            "networks": [
                                "nebula"
                            ],
                            "volumes": [
                                "/tmp:/tmp/1",
                                "/var/tmp/:/var/tmp/1:ro"
                            ],
                            "devices": [],
                            "privileged": false,
                            "rolling_restart": false
                        }
                    ],
                    "apps_list": [
                        "example"
                    ],
                    "prune_id": 1,
                    "cron_jobs": [],
                    "cron_jobs_list": [],
                    "device_group_id": 1
                }
            },
            "device_group": "example",
            "report_creation_time": 1562409049,
            "hostname": "worker",
            "report_insert_date": {
                "$date": 1562409049716
            }
        }
    ],
    "last_id": {
        "$oid": "5d20785900bb37cdd5352c5c"
    }
}

I am unable to find which key from the above API can help me if the device is updated or failed or is there another API for finding this ( I am unable to find any other API for this purpose. )

I also checked the database I got the following results =>

# mongo
> use nebula
switched to db nebula
> show collections
nebula_apps
nebula_cron_jobs
nebula_device_groups
nebula_reports
nebula_user_groups
nebula_users

I have checked the nebula_reports collection I got the same output What I got with the above API call.

What am I doing wrong here?

issue-label-bot[bot] commented 5 years ago

Issue-Label Bot is automatically applying the label question to this issue, with a confidence of 0.90. Please mark this comment with :thumbsup: or :thumbsdown: to give our bot feedback!

Links: app homepage, dashboard and code for this bot.

naorlivne commented 5 years ago

The only reporting API ifs he /reports API - you can filter it by the worker hostname (note this will be the worker container hostname not the server hostname), device_group & by unix timestamp like described in https://nebula.readthedocs.io/en/latest/api/general/#list-a-filtered-paginated-view-of-the-optional-reports-system

There are two things in the report that will help ensure that the status updated successfully:

  1. The apps field under `current_device_group_config' will give you the current app ID of each app, when it matches the one you get from the /app/app_name you know that the device has pulled the latest needed config from Nebula.
  2. The apps_containers are the current container on said device as given at the report_creation_time, the id of each is uniuqe per container so if it changes you know the container crashed and a new one started in it's place

So basically if the current_device_group_config includes the correct info and the apps_containers shows a container up of said app you know it's running.

If there's a specific bit of info you think will be helpful to add that can be gleaned off the worker feel free to open a ticket suggesting it to be added (aside from the worker logs, there are many preexisting ways to centralize those already that can be combined with Nebula, like an ELK stack with an app running filebeat as nebula app for example)

Sharvin26 commented 5 years ago

Hello, @naorlivne Thanks for the Response.

Scenario =>

I have configured the MongoDB on the Host ( For persisting the Database whenever there is a change in Manager or worker ) and I have configured the Manager and reporter connection with the Host MongoDB. All the connections and Updates are working properly.

Now once the update is completed I get a new app id and report creation time also in the http://<my_vps_url>/api/v2/reports API.

Problem =>

But I saw a behavior that the records from nebula_reports section of the nebula database get deleted automatically after some time. ( Approximately after 2 to 3 hours ) After that Whenever I send a request on this API http://<my_vps_url>/api/v2/reports I get the following response =>

{
    "data": null,
    "last_id": null
}

Is it the Intended behavior that records from nebula_reports section will clean automatically after some time or am I doing something wrong?

Expected Behavior =>

I was expecting the records to be persisted in the database unless cleared or deleted manually.

References =>

I have attached both the docker-composes for the reference purpose =>

docker-compose.yml for Manager, zookeeper, kafka and reporter

version: '3'
services:
  manager:
    container_name: manager
    hostname: manager
    image: nebulaorchestrator/manager
    ports:
      - "80:80"
    restart: unless-stopped
    network_mode: host
    environment:
      MONGO_URL: mongodb://localhost:27017/nebula?authSource=admin
      SCHEMA_NAME: nebula
      BASIC_AUTH_PASSWORD: nebula
      BASIC_AUTH_USER: nebula
      AUTH_TOKEN: nebula

  zookeeper:
    container_name: zookeeper
    hostname: zookeeper
    image: zookeeper:3.4.13
    ports:
      - 2181:2181
    restart: unless-stopped
    environment:
      ZOO_MY_ID: 1

  kafka:
    container_name: kafka
    hostname: kafka
    image: confluentinc/cp-kafka:5.1.2
    ports:
      - 9092:9092
    restart: unless-stopped
    depends_on:
      - zookeeper
    environment:
      KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
      KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://<my_vps_url>:9092
      KAFKA_BROKER_ID: 1
      KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1

  reporter:
    container_name: reporter
    hostname: reporter
    depends_on:
      - kafka
    image: nebulaorchestrator/reporter
    restart: unless-stopped
    network_mode: host
    environment:
      MONGO_URL: mongodb://localhost:27017/nebula?authSource=admin
      SCHEMA_NAME: nebula
      BASIC_AUTH_PASSWORD: nebula
      BASIC_AUTH_USER: nebula
      KAFKA_BOOTSTRAP_SERVERS: localhost:9092
      KAFKA_TOPIC: nebula-reports

docker-compose.yml for worker =>

version: '3'
services:
  worker:
    container_name: worker
    build:
      context: .
      dockerfile: Dockerfile
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
    restart: unless-stopped
    hostname: worker
    environment:
      REGISTRY_HOST: <my_registry_url>
      REGISTRY_AUTH_USER: <my_registry_user>
      REGISTRY_AUTH_PASSWORD: <my_registry_password>
      MAX_RESTART_WAIT_IN_SECONDS: 0
      NEBULA_MANAGER_AUTH_USER: nebula
      NEBULA_MANAGER_AUTH_PASSWORD: nebula
      NEBULA_MANAGER_HOST: <my_vsp_url>
      NEBULA_MANAGER_PORT: <my_manager_port>
      NEBULA_MANAGER_PROTOCOL: http
      NEBULA_MANAGER_CHECK_IN_TIME: 5
      DEVICE_GROUP: example
      KAFKA_BOOTSTRAP_SERVERS: <my_vps_url>:9092
      KAFKA_TOPIC: nebula-reports
naorlivne commented 5 years ago

You are correct about the reporting data being purged after some time has passed, by default this is a value of 3600 seconds (1 hour) but can be changed by configuring mongo_report_ttl on your reporter as described in https://nebula.readthedocs.io/en/latest/config/reporter/

The logic behind it being the default is that Nebula was created to allow a large volume of workers which can quickly create a very large volume of data for the DB to handle & that most people won't care what happened to their workers an hour ago - as new data is still being sent and only data older then an hour is being purged out.

I'm guessing that the reason why your seeing null data on the DB is that either the worker is powered off so it's not sending new data or that your filtering the report by timestamp so reports newer then a given timestamp aren't shown (or both)?

Sharvin26 commented 5 years ago

I'm guessing that the reason why your seeing null data on the DB is that either the worker is powered off so it's not sending new data or that your filtering the report by timestamp so reports newer then a given timestamp aren't shown (or both)?

Yes, you're correct I had turned off the device as I wanted to confirm if my understanding related to the purge was correct. As the device was continuously reporting at NEBULA_MANAGER_CHECK_IN_TIME ( i.e. In my case it was reporting in around 5 seconds )

You are correct about the reporting data being purged after some time has passed, by default this is a value of 3600 seconds (1 hour)

Did you mean a report which has a report_creation_time older than 1 hour then, in that case, that device report will purge?

The logic behind it being the default is that Nebula was created to allow a large volume of workers which can quickly create a very large volume of data for the DB to handle & that most people won't care what happened to their workers an hour ago - as new data is still being sent and only data older then an hour is being purged out.

Yes, you are right due to the continuously reporting there becomes a large volume of the data for the DB to Handle.

I have a doubt What's the intention behind reporting the data continuously to the reporter from the worker?

I was expecting the scenario in which the data will be reported only when the device is updated successfully or update failed. This can help to maintain the whole update history of the device and a large volume of data won't be accumulated in the database. ( This data can optionally be purged after 6 months or a year. )

naorlivne commented 5 years ago

You are correct about the reporting data being purged after some time has passed, by default this is a value of 3600 seconds (1 hour)

Did you mean a report which has a report_creation_time older than 1 hour then, in that case, that device report will purge?

No, the purge happens based on the report_insert_date timestamp so each report will be deleted after mongo_report_ttl time has passed since it was inserted into MongoDB.

The logic behind it being the default is that Nebula was created to allow a large volume of workers which can quickly create a very large volume of data for the DB to handle & that most people won't care what happened to their workers an hour ago - as new data is still being sent and only data older then an hour is being purged out.

Yes, you are right due to the continuously reporting there becomes a large volume of the data for the DB to Handle.

I have a doubt What's the intention behind reporting the data continuously to the reporter from the worker?

This is mostly done to allow users to have a constant report about the memory\CPU\etc status of the workers

I was expecting the scenario in which the data will be reported only when the device is updated successfully or update failed. This can help to maintain the whole update history of the device and a large volume of data won't be accumulated in the database. ( This data can optionally be purged after 6 months or a year. )

It might be possible to add a filter to the /reports endpoint to return only changes of ID's\etc to assist in that, please open a ticket about that feature request with your exact needs\suggestion.