megaease / easeprobe

A simple, standalone, and lightweight tool that can do health/status checking, written in Go.
Apache License 2.0
2.18k stars 231 forks source link

Notifications via DingTalk service sometimes unsuccessful #467

Closed zhaoyou closed 2 months ago

zhaoyou commented 10 months ago

Environment (please complete the following information):

Describe the bug When using the DingTalk notification service, some exception notifications can be sent out and some cannot. The message is as follows: WARN [2024-01-07T10:12:38+08:00] [dingtalk / Dingtalk alert service / Notification] Retried to send 1/3 - Error response from Dingtalk [%!d(float64= 40035)] - [{"errcode":40035, "errmsg": "Missing parameter json"}]

Expected behavior The notifications either both succeed or fail, and the feedback from the DingTalk service response shows that the request is not compliant. I'm not sure what's wrong.

samanhappy commented 10 months ago

Thank you for submitting this issue, the error message Missing parameter json may be caused by special characters that are incompatible with the JSON format.

Could you check if there are any such characters in your configuration file?

Alternatively, kindly provide the content of your configuration file(ensuring data anonymization if needed) so that we can identify and resolve the issue.

suchen-sci commented 10 months ago

@samanhappy is right. Providing more information is also better. By the way, I will also do some tests to check if there are some potential bugs or compatibility issues.

zhaoyou commented 10 months ago
  1. My configuration file does not contain these characters
  2. config notify part:

    
    notify:
    dry: true # dry notification, print the Discord JSON in log(STDOUT)
    timeout: 20s # the timeout send out notification, default: 30s
    retry: # somehow the network is not good and needs to retry.
    times: 3 # default: 3
    interval: 5s # default: 5s
    
    wecom:
    - name: "Wecom alert service"
      webhook: "https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=xxx"
    dingtalk:
    - name: "Dingtalk alert service"
      webhook: "https://oapi.dingtalk.com/robot/send?access_token=xxx"
    
    log:
    - name: "Local Log"
      file: "/mnt/logs/easeprobe_notify.log"
      dry: true
samanhappy commented 10 months ago

Could you please provide the complete configuration content to let us test it in our local enviroment?

zhaoyou commented 10 months ago
# Global settings for all probes and notifiers.
settings:

  # The customized name and icon
  name: "EaseProbe" # the name of the probe: default: "EaseProbe"
  #icon: "https://path/to/icon.png" # the icon of the probe. default: "https://megaease.com/favicon.png"
  # Daemon settings

  # pid file path,  default: $CWD/easeprobe.pid,
  # if set to "", will not create pid file.
  #pid: /var/run/easeprobe.pid
  #timeformat: "2006-01-02 15:04:05"

  # A HTTP Server configuration
  http:
    ip: 127.0.0.1 # the IP address of the server. default:"0.0.0.0"
    port: 8181 # the port of the server. default: 8181
    refresh: 5s # the auto-refresh interval of the server. default: the minimum value of the probes' interval.
    log:
      file: /mnt/logs/easeprobe_http_access.log # access log file. default: Stdout
      # Log Rotate Configuration (optional)
      self_rotate: true # true: self rotate log file. default: true
                        # false: managed by outside  (e.g logrotate)
                        #        the blow settings will be ignored.
      size: 10 # max of access log file size. default: 10m
      age: 7 #  max of access log file age. default: 7 days
      backups: 5 # max of access log file backups. default: 5
      compress: true # compress the access log file. default: true
    # SLA Report schedule
    sla:
       #  daily, weekly (Sunday), monthly (Last Day), none
      schedule : "daily"
      # UTC time, the format is 'hour:min:sec'
      time: "23:59"
      # debug mode
      # - true: send the SLA report every minute
      # - false: send the SLA report in schedule
      debug: false
      # SLA data persistence file path.
      # The default location is `$CWD/data/data.yaml`
      # Use the following to disable SLA data persistence
      # data: "-"
      backups: 5 # max of SLA data file backups. default: 5
               # if set to a negative value, keep all backup files

# HTTP Probe configuration
http:
  - name: 通知-example.com
    url: https://example.com
    timeout: 20s
    interval: 15m
    # HTTP SUCCESS response code range, default is [0, 499]
    success_code:
      - [200,206]
      - [300,308]

  - name: 通知-数据接口1
    url: http://111.26.70.40:8085
    timeout: 20s
    interval: 15m
    # HTTP SUCCESS response code range, default is [0, 499]
    success_code:
      - [200,206]
      - [300,308]

  - name: 通知-数据接口2
    url: http://111.26.70.40:8083
    timeout: 20s
    interval: 5m
    # HTTP SUCCESS response code range, default is [0, 499]
    success_code:
      - [200,200]

  - name: 通知-数据接口3
    url: http://111.26.70.40:8084
    timeout: 20s
    interval: 5m
    # HTTP SUCCESS response code range, default is [0, 499]
    success_code:
      - [200,200]

# TCP Probe Configuration
notify:
  dry: true # dry notification, print the Discord JSON in log(STDOUT)
  timeout: 20s # the timeout send out notification, default: 30s
  retry: # somehow the network is not good and needs to retry.
    times: 3 # default: 3
    interval: 5s # default: 5s

  wecom:
    - name: "Wecom alert service"
      webhook: "https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=xxx"
  dingtalk:
    - name: "Dingtalk alert service"
      webhook: "https://oapi.dingtalk.com/robot/send?access_token=xxx"

  log:
    - name: "Local Log"
      file: "/mnt/logs/easeprobe_notify.log"
      dry: true
samanhappy commented 10 months ago

Appreciated for the thorough response, the configuration content does not include any special characters as you mentioned, I apologize for my hasty speculation.

After checking the code I find that the notification content will incorporate the error message related to the probe, and it will be logged in the following format: ERRO[2024-01-08T16:12:33+08:00] [http / 通知-example.com] error making get request: Get "https://example.com1": dial tcp: lookup example.com1: no such host

Could you please check your enviroment for a similar log entry? I've tested the configuration, but I was unabled to reproduce the Dingtalk error.

suchen-sci commented 10 months ago

Some online blogs say that error 40035 is related with invalid json payload. But as i tested, easeprobe produces valid json payload which match the request of dingtalk official document. It is also wired that some exception notifications can be sent out and some cannot (some problem with dingtalk api server?)

If possible, could you please add a log to notify/dingtalk/dingtalk.go SendDingtalkNotification function and deploy it on your local machine? Then when meet this error again, you will know what message easeprobe sent.

zhaoyou commented 10 months ago

Thank you for your reply, it's strange, I don't know what's wrong with the configuration file to configure the logging, but my corresponding directory did not find the program running log nor notification logs

file: "/mnt/logs/easeprobe_notify.log"
log: /mnt/logs/easeprobe_http_access.log 
zhaoyou commented 10 months ago

The console displays the following logs, but no log files are found in the corresponding log directory

INFO[0000] Clean data file: data/data.yaml-2023-09-11T10_58_25.259281271Z 
INFO[0000] Load the configuration file successfully!    
INFO[0000] Successfully created the PID file: /home/tbcc/release/easeprobe/easeprobe.pid 
INFO[0000] Application Log File [Stdout] - Self-Rotate  
INFO[0000] Web Access Log File [/mnt/logs/easeprobe_http_access.log] - Self-Rotate 
INFO[2024-01-09T09:38:52+08:00] [Web] Access Log output file: /mnt/logs/easeprobe_http_access.log 
INFO[2024-01-09T09:38:52+08:00] [Web] HTTP server is listening on 127.0.0.1:8181 
INFO[2024-01-09T09:38:52+08:00] Probe [http] - [通知-Thermoberg CCDCC] base options are configured! 
INFO[2024-01-09T09:38:52+08:00] [Metric] Counter <EaseProbe_http_total> is created! 
INFO[2024-01-09T09:38:52+08:00] [Metric] Gauge <EaseProbe_http_duration> is created! 
INFO[2024-01-09T09:38:52+08:00] [Metric] Gauge <EaseProbe_http_status> is created! 
INFO[2024-01-09T09:38:52+08:00] [Metric] Gauge <EaseProbe_http_sla> is created! 
INFO[2024-01-09T09:38:52+08:00] [Metric] Counter <EaseProbe_http_status_code> is created! 
INFO[2024-01-09T09:38:52+08:00] [Metric] Gauge <EaseProbe_http_content_len> is created! 
suchen-sci commented 10 months ago

please change dry to false. dry run means the log will not be notified.

suchen-sci commented 10 months ago

/mnt/logs/easeprobe_http_access.log only active when you call easeporbe via port 8181, for example, curl http:127.0.0.1:8181/metrics. it logs when easeprobe access by other users.

and for /mnt/logs/easeprobe_notify.log, you need to change dry: false.

zhaoyou commented 10 months ago

I have both of these parameters set to false. i'll keep an eye on it for a while

截屏2024-01-10 12 46 52

suchen-sci commented 10 months ago
notify:
  dry: true # dry notification, print the Discord JSON in log(STDOUT)
  timeout: 20s # the timeout send out notification, default: 30s
  retry: # somehow the network is not good and needs to retry.
    times: 3 # default: 3
    interval: 5s # default: 5s

based on the manual in https://github.com/megaease/easeprobe/blob/main/docs/Manual.md#72-notification-configuration notify doesn't has field of dry. manual says all the notifications in notify has parameters of dry, not means itself has dry.

if you want to set filed like dry for all notifications, you should set them in settings. like https://github.com/megaease/easeprobe/blob/main/docs/Manual.md#73-global-setting-configuration this.

zhaoyou commented 10 months ago

This error still exists, added the log level to info, but the logs don't see anything useful, is it possible to see the request body of the push message in debug mode?

image

suchen-sci commented 10 months ago

Hi, i will make a pr to do that. Please wait.

suchen-sci commented 10 months ago

Hi, can you download the newest version of code from github and use make command to compile it. Then try it again? I am sure this time the error message will provide more information about it.

suchen-sci commented 2 months ago

A new version of easeprobe is released, reopen this issue if the problem is still there.