opensearch-project / dashboards-observability

Visualize and explore your logs, traces and metrics data in OpenSearch Dashboards
https://opensearch.org/docs/latest/observability-plugin/index/
Apache License 2.0
16 stars 55 forks source link

[FEATURE] OpenSearch Synthetics #75

Open ps48 opened 2 years ago

ps48 commented 2 years ago

Synthetics Design Document

  1. Overview
  2. Motivation
  3. How is it different from other plugins?
  4. Requirements
  5. UI Mockups
  6. Architecture
  7. Miscellaneous
  8. Data Model
  9. OpenSearch/Plugins REST endpoints
  10. Appendix
  11. References

Code: https://github.com/opensearch-project/observability/tree/uptime

1. Overview

Synthetics is a new module in observability that enables users to monitor availability and response times of applications and services in real time. This tool provides the ability to understand availability and response time components of services and applications. Users can detect problems before they affect their end customers.

2. Motivation

Synthetics systems are useful for measuring stability, reliability and analysis of health on live systems. Continuous monitoring of micro-service based software systems is an essential component of observability. Synthetics opens up the door for infrastructure visibility by proactively pinging API endpoints. This can be considered as an auto-pilot extension to human observation, especially when merged with other OpenSearch capabilities namely: reporting, anomaly detection and alerting.

Synthetics can be utilized for the following use-cases:

  1. Measure availability of apps or a services from a public endpoint, as well as specific components
  2. Showcasing different response times from different locations around the world
  3. Get historical view of logs over a period of time
  4. Get live status view of apps and services

3. How is it different from other plugins?

4. Requirements

4.1 Functional Requirement

  1. Users must be able to make endpoint API requests as scheduled jobs.
  2. Users must be able to make endpoint API requests on Demand
  3. Users should be able to create synthetics test-suites for public endpoints.
  4. Users should be able to create test-suites for OpenSearch, OpenSearch Dashboards and its plugins using REST endpoints.

4.1 Dashboards Observability

4.1.1 Synthetics Home

  1. Users should be able to view summary graphs of test-suites availability and pings over time.
  2. Users should see a history table of all recent requests made from all test-suites.
  3. Users should be able to filter the graphs and table on home page with a PPL search bar.
  4. Users should be able to filter home page with a date filter.
  5. Users should be able to investigate deeper into a test-suite with links to related events, visualizations, panels and App Analytics page.

4.1.2 Test-Suite View

  1. Users should see test-suite summary table with locations, URL, protocol and tags.
  2. Users should be able to view 3 graphs: test-suite duration, pings over time and location(table/map) based test-suites.
  3. Users should be able to see a history table with recent pings (1000), each row should expand with the detailed request view of response headers, body and timing chart.
  4. Users should be able filter the history table by location and sort table by response duration, error message or response code.

4.1.3 Add & Configure Test-Suite

  1. Users should be able to add new test-suites to the synthetics plugin.
  2. Users should be able to process response with PPL queries, to validate the app and service being up.
  3. Users should be able to configure a test-suite with the options below:
    1. Id
    2. Name
    3. Protocol/Type of endpoint: [http, tcp, imcp]
    4. App Id
    5. Notebook Id
    6. Saved Query Id
    7. Saved Visualization Id
    8. Panel Id
    9. Enabled bool
    10. Scheduler Job Id
    11. ipv4 bool
    12. ipv6 bool
    13. ip resolver mode
    14. timeout
    15. tags
    16. keep_null
    17. Type specific configuration:
      1. ICMP:
        1. Hosts Array
        2. Wait duration
      2. TCP:
        1. hosts
        2. ports
        3. check
        4. proxy url
        5. proxy resolver
        6. ssl
          1. certificate authorities
          2. certificate [for client SSL auth]
          3. key [for client SSL auth]
          4. supported protocols e.g ["TLSv1.0", "TLSv1.1", "TLSv1.2"]
          5. key-passphrase
          6. verification-mode: [full, strict, verification, none]
      3. HTTP
        1. hosts
        2. max redirects
        3. proxy url
        4. proxy headers
        5. username
        6. password
        7. ssl [similar to tcp]
        8. index headers
        9. index response: [on_error, never, always] controls when to index the response
        10. check: [method/status, headers, body] applies for both request and response

Note: Username and passwords are stored for accessing an endpoint with HTTP basic authentication.

More detailed HTTP request:

# Synthetics Client in Opensearch-Observability

## Test-Suite Configuration:

Taking a look at `sample_testsuite.yml`, we can see:
---
  name: "Sample Test Suite"
  type: "http"
  appId: ""
  notebookId: ""
  savedQueryId: ""
  savedVisualizationId: ""
  operationalPanelId: ""
  ipv4: true
  ipv6: true
  resolverMode: "all"
  timeoutSeconds: 16
  tags:
  - "news"
  - "apis"
  keepNull: true
  hosts:
  - "https://opensearch.org"
  - "https://opensearch.org/synthetics"
  - "https://github.com/opensearch-project"
  maxRedirects: 1
  proxyURL: ""
  proxyHeaders: {}
  username: ""
  password: ""
  ssl:
    enabled: false
    default: true
    certificateAuthorities: CAINFO
    certificate: ""
    key: ""
    supportedProtocols: "TLSv1.1" # https://curl.se/libcurl/c/CURLOPT_SSLVERSION.html
    keyPassphrase: ""
    ecAlgorithmCurves: ""
    falseStart: false # https://curl.se/libcurl/c/CURLOPT_SSL_FALSESTART.html
    cipherList: "" # https://curl.se/docs/ssl-ciphers.html
    verifyHost: true # https://curl.se/libcurl/c/CURLOPT_SSL_VERIFYHOST.html
    verifyPeer: true # https://curl.se/libcurl/c/CURLOPT_SSL_VERIFYPEER.html
  indexHeaders: true
  indexResponse: "always"
  request:
    method: "GET"
    headers: {'Accept-Encoding': None, 'Content-Encoding':'gzip'}
    body: ""
    json: {}
  response:
    status:
    - 200
    - 301
    headers: {}
    body: {}
  scheduler:
    scheduleType: "interval"
    schedule:
      period: 20
      unit: "seconds"

* The `name` field is the name you specify for this particular Test-Suite, has to be alpha-numberic.
* `type` is the protocol for this Test-Suite, you can choose from either http, tcp, or icmp.
* `appId`, `notebookId`, `savedQueryId`, `savedVisualizationId`, and `operationalPanelId` are all neccessary ids for if you want to connect this Test-Suite with other OpenSearch components.
* `ipv4` is a boolean value specifying if the request should be handled through IPv4
* `ipv6` is a boolean value specifying if the request should be handled through IPv6
* `resolverMode` takes a string that specifies the resolver mode
* `timeoutSeconds` is the amount of time (in seconds) before any connection should time out, takes a number
* `keepNull` is about whether fields with no values should be kept null, takes a boolean
* `hosts` can be a list of endpoints that are in the Test-Suite and will be pinged based on the schedule
* `maxRedirects` is the max number of redirects the connection should go through before being cut off.
* `proxyURL` takes a url that can be the proxy url
* `proxyHeaders` are the headers that would go with the proxy url
* `username` is a string that can be attached onto the request that will act as a username
* `password` is a string that can be attached onto the request that will act as a password
* `ssl` is where various ssl certificates' paths can be specified and used:
  - `enabled` by default (also when ssl is not specified) will not attempt to use SSL. However, if true is specified, then SSL is required for all communication or there will be an ERROR thrown 
  - `default` is an option, where if true, will attempt to find and use CA bundles contained in the `certifi` library. If false, a CA path is needed in `certificateAuthorities`.
  - `certificateAuthorities` is a needed file path leading to a valid CA cert bundle for SSL unless `verifyPeer` is false, in which case an empty file can be given.
  - `certificate` should be a file path leading to a certificate with the type "PEM".
  - `key` should be a file path to a private key with type "PEM". 
  - `keyPassphrase` would be the password required to use `key`
  - `supportedProtocols` is the protocol version range for the SSL/TLS handshake to use. The input protocol verison will be the minimum protocol used, where later versions can be used as well. For exmaple, if the value here is "TLSv1.1", then TLSv1.1, TLSv1.2, and TLSv1.3 would be available for use. The default value is "TLSv1.0". This method of input is in use to keep behavior of TLS libraries consistent.
  - `ecAlgorithmCurves` needs input as a colon delimited list of EC algorithms. 
  - `falseStart` is a boolean with a default value of false, where when true false start would be in use. False start is a mode where a TLS client saves a round trip on a full handshake by sending data before verifying the server message.
  - `cipherList` needs to be a colon delimited string of one or more ciphers. Default follows an internal configuration.
  - `verifyHost`, when true, verifys that the connecting server has the same name in the certificate and host name, and will fail if otherwise. Default is true.
  - `verifyPeer` checks if the server certificate is authentic. This authenticity is based in the CA certificates supplied. Default is true. Disabling makes communication insecure and allows for man-in-the-middle attacks.
* `indexHeaders` is a boolean that will index the headers if true
* `indexResponse` specifys whether an index response should return or not
* `request.method` is the protocol's method that should be used
* `request.headers` are the request headers that will be sent along with the request. Must be in JSON format.
* `request.body` is the body that will be sent along with the request
* `request.json` is a json that can be sent along with the request
* `response.status` are possible statuses that can result in an 'UP' status for a host.
* `response.headers` are headers that the response will check to have
* `response.body` is a body that the response will check to have
* `scheduler.schedule-type` is the type of schedule that should be used. Possible types are 'interval' and 'cron'
* `scheduler.schedule` can either have two settings based on what the schedule-type was:
  - example interval:

    period: 20
    unit: 

The period specifies the quantity of time between each interval (has to be a number) and the unit specifies the unit of that time (has to be one of "weeks, "days", "hours", "minutes", and "seconds"). The job will trigger once and then after each interval occurs. Documentation

4.1.4 Certificates

  1. Users should be able to store certificates for TLS/SSL requests.
  2. Users should be able to view a table of all certificates with their name, age, expiration date, issuer and SHA fingerprints.

4.1.5 Settings

  1. Users should be able select certificate age threshold, to display certificate warning.
  2. Users should be able to configure the auto-delete interval for deleting old endpoint responses stored in the observability index.

4.2 OpenSearch Observability

4.2.1 Observability Scheduler

  1. Register scheduled jobs for running synthetics test-suites in the jobs-definition.
    1. Based on intervals
    2. Based on cron jobs
  2. Use Endpoint Client APIs to run scheduled jobs at given interval/cron time.
  3. Callback Endpoint Client after a scheduled job is run, to store the response and timings.

4.2.2 Endpoint Client

  1. Allow users to make calls to ICMP, TCP and HTTP endpoints.
  2. Endpoint client should be to make auth based requests with certificates and other SSL configurations.
  3. Endpoint client should be to send a payload in request and validate received payload in response.
  4. Endpoint client should be able to support SOCKS5 proxy endpoints.

4.2.3 Indexing Client

  1. Query and store monitor configuration including endpoint details in .opensearch-observability index.
  2. Query and store endpoint certificates in .opensearch-observability index.
  3. Query and store all request information (headers, responses and timings) in a new observability-synthetics-logs index
  4. Use the scheduler to auto-delete old response at regular intervals.

4.3 Optional requirements:

4.3.1 Plugin Integrations

  1. Reporting needs a direct integration to the synthetics component, there is no “custom” way to use it like alerting and anomaly detection.
  2. For Alerting & Anomaly Detection, users can configure alerts and detectors on the observability-synthetics-logs index. This index will store the logs generated by synthetics test-suites.

4.3.2 Reporting

  1. Users should be able to generate reports on summary of synthetics stats for a given time period.
  2. Users should be able to generate report for a particular test-suite.

4.3.3 Anomaly Detection

  1. Users should be able add a detector to find anomaly in response time over a given period for a test-suite.
  2. Users should be able to view/delete/edit added anomaly detectors.

4.3.4 Alerting

  1. Users should be able to add and configure alerting monitors for:
    1. A down status of an endpoint
    2. A down status over a period of time
    3. Increase in response time over a threshold
  2. Users should be able to view all the alerting monitors configured for synthetics
  3. Users should be able to delete an alerting monitor.

6. Architecture

6.1 Architecture 1

image-1(1)

6.1.1 OpenSearch-Observability

6.1.2 Dashboards-Observability

Pros

Cons

6.2 Architecture 2

Screen Shot 2022-01-07 at 1 32 27 PM-edited

6.2.1 OpenSearch-Observability

6.2.2 Dashboards-Observability

Pros

Cons

6.3 Architecture 3 [Preferred]

image-edited

6.3.1 OpenSearch-Observability

6.3.2 Dashboards-Observability

Pros

Cons

7. Miscellaneous

7.1 FGAC for OpenSearch

7.2 Options for having location in Synthetics:

Some potential solutions:

  1. Require users to provide their own Location requesting service with their own api key so that they can be responsible with how many requests they might make
  2. Find some way to keep a database for location requesting in our own service as to allow users to make unlimited calls
  3. Pay for it ourselves (not recommended)
  4. Have a location table index, where IP addresses are prepopulated and the customer can just grab the index and do the IP-location conversion themselves. This can then be incorperated into any information/maps in Synthetics UI [preferred]

8. Data Model

8.1 syntheticsTestSuite (Architecture-1 [6.1])

{eiifccugvrllerjcdgggkhfhkikcngueekfhrccbvvhn

    "syntheticsTestSuite": {
        "name": "Sample Test-Suite",
        "type": "http",
        "appId": "",
        "notebookId": "",
        "savedQueryId": "",
        "savedVisualizationId": "",
        "operationalPanelId": "",
        "enabled": true,
        "ipv4": true,
        "ipv6": true,
        "resolverMode": "all",
        "timeoutSeconds": 16,
        "tags": ["news", "apis"],
        "keep_null": true,
        "hosts": ["http://samplehost:8000"],
        "maxRedirects": 0,
        "proxyURL": "http://proxy.mydomain.com:3128",
        "proxyHeaders": {},
        "username": "",
        "password": "",
        "ssl": {},
        "indexHeaders": true,
        "indexResponse": "always",
        "check": {
            "request": {
                "method": "GET",
                "headers": {},
                "body": ""
            },
            "response": {
                "status": 200,
                "headers": {},
                "body": {}
            }
        }
    }
}
// Information passed to observability job scheduler, 
// while configuring a test-suite.
// Returned Job Scheduler Id is stored in the test-suite index. 
{ 
    "scheduler": {
        "schedule_type": "recurring",
        "schedule": {
            "interval": {
                "period": 12,
                "unit": "HOURS",
                "start_time": 1635463349333
            },
            "enabled": true,
            "enabled_time": 1635463349332
        }
    }
}

8.2 syntheticsTestSuite (Architecture-2 6.3)

---
  name: "Sample Test Suite"
  type: "http"
  appId: ""
  notebookId: ""
  savedQueryId: ""
  savedVisualizationId: ""
  operationalPanelId: ""
  enabled: true
  ipv4: true
  ipv6: true
  resolverMode: "all"
  timeoutSeconds: 16
  tags:
  - "news"
  - "apis"
  keep_null: true
  hosts:
  - "http://opensearch.org"
  - "http://opensearch.org/synthetics"
  - "http://github.com/opensearch-project"
  maxRedirects: 0
  proxyURL: ""
  proxyHeaders: {}
  username: ""
  password: ""
  ssl: {}
  indexHeaders: true
  indexResponse: "always"
  request:
    method: "GET"
    headers: {'Accept-Encoding': None, 'Content-Encoding':'gzip'}
    body: ""
    json: {}
  response:
    status:
    - 200
    headers: {}
    body: {}
  scheduler:
    schedule-type: "interval"
    schedule:
      period: 20
      unit: "seconds"

8.3 syntheticsLogs

{
    "syntheticsLog": {
        "syntheticsSuiteId": "",
        "status": "UP",
        "type": "http",
        "URL": "http://samplehost:8000",
        "request": {
            "method": "GET",
            "headers": {},
            "body": ""
        },
        "response": {
            "status": 200,
            "headers": {},
            "body": {}
        },
        "startTime": 1635463349345,
        "endTime": 1635463349398,
        "dnsTimeMs": 100,
        "ConnectionTimeMs": 500,
        "sslTimeMs": 0, 
        "ttfbMs": 121,
        "downloadTimeMs": 2, 
        "contentSizeKB": 3.5
    }
}

[Optional] 8.4 syntheticsSettings

{
    "syntheticsSettings": {
        "certificateThresholdDays": 30, //UI will show a warning if a certificate is under the mentioned threshold  
        "logAutoDeletePeriod": 30, // Logs above this age will be auto deleted
        "logAutoDeleteUnit": "DAYS"
    }
}

9. OpenSearch/Plugins REST endpoints

9.1 Using pre-existing REST endpoints

9.1.1 Stats APIs

9.1.2 Other miscellaneous APIs

9.2 Health Check endpoints

10. Appendix

10.1 Alerting & Anomaly Detection

10.2 Reporting

10.3 Synthetic monitoring

11. References

11.1 https://geekflare.com/monitor-website-uptime/

11.2 https://www.datadoghq.com/uptime-monitoring-tools/

11.3 https://cabotapp.com/

11.4 https://github.com/arachnys/cabot#single-service-overview

11.5 https://alyvix.com/learn/introduction.html

ps48 commented 2 years ago

Synthetics Demo Video from @paulstn

https://user-images.githubusercontent.com/4348487/166390233-11dbe004-3408-4174-905b-e0fef43fb035.mov

ps48 commented 2 years ago

Issues associated:

Synthetics client

  1. opensearch-project/dashboards-observability#74
  2. opensearch-project/dashboards-observability#73
  3. opensearch-project/dashboards-observability#72
  4. opensearch-project/dashboards-observability#71
  5. opensearch-project/dashboards-observability#70
  6. opensearch-project/dashboards-observability#69
  7. opensearch-project/dashboards-observability#68
  8. opensearch-project/dashboards-observability#67
  9. opensearch-project/dashboards-observability#66
  10. opensearch-project/dashboards-observability#65
  11. opensearch-project/dashboards-observability#64

Synthetics UI

TODO: add work breakdown/ pending issues

elfisher commented 2 years ago

hey is this still tracking for 2.3?

rafael-gumiero commented 1 year ago

Any plan in which version this functionality will be included?

ps48 commented 1 year ago

@rafael-gumiero we don't have this feature in our priority right now. We are open to guiding community contributions. cc: @anirudha @paulstn