openeuler-riscv / oerv-team

OERV 实习生工作中心
11 stars 43 forks source link

A-Ops框架服务验证 #104

Closed jiewu9823 closed 8 months ago

jiewu9823 commented 10 months ago

参考 https://docs.openeuler.org/zh/docs/22.03_LTS/docs/A-Ops/overview.html 进行验证 1.所有openEuler 官网提到的功能都验证一遍

  1. 如果遇到软件包不支持需要手动编译定位问题,并将问题记录在本 issue 中
  2. 如果是A-Ops软件包的问题,请将问题同步到仓库 https://gitee.com/openeuler/A-Ops 的 issues中
menmazqj commented 9 months ago

参考 https://docs.openeuler.org/zh/docs/22.03_LTS/docs/A-Ops/overview.html 进行验证

本次验证主机信息如下 1707036918448

虚拟机网络设置为桥接模式,具体信息如下

[root@localhost ~]# ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: ens33: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether 00:0c:29:c1:58:bd brd ff:ff:ff:ff:ff:ff
    altname enp2s1
    inet 192.168.1.7/24 brd 192.168.1.255 scope global dynamic noprefixroute ens33
       valid_lft 57342sec preferred_lft 57342sec
    inet6 fe80::f444:1db:3bb3:caaa/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever
menmazqj commented 9 months ago

由于文档中命令和更改配置文件比较多,这里就不详细列举本此验证的具体细节了。 文档中从第四小节(安装下载)至第七小节(启动和关闭服务)都验证成功

第八小节(网页服务启动)验证结果

  1. web服务正常 1707066694941

  2. 填入默认用户名(admin)和密码(changeme)后,显示错误 image 目前还在排查问题

menmazqj commented 9 months ago

上述错误排查: 查看nginx log文件

[root@localhost ~]# less /var/log/nginx/debug.log
......
Referer: http://192.168.1.7/user/login?redirect=%2F
Accept-Encoding: gzip, deflate
Accept-Language: zh-CN,zh;q=0.9,en;q=0.8,en-GB;q=0.7,en-US;q=0.6

"
2024/02/19 13:29:00 [debug] 2009#0: *3 http cleanup add: 000055CAED351858
2024/02/19 13:29:00 [debug] 2009#0: *3 get rr peer, try: 1
2024/02/19 13:29:00 [debug] 2009#0: *3 stream socket 22
2024/02/19 13:29:00 [debug] 2009#0: *3 epoll add connection: fd:22 ev:80002005
2024/02/19 13:29:00 [debug] 2009#0: *3 connect to 127.0.0.1:11111, fd:22 #5
2024/02/19 13:29:00 [debug] 2009#0: *3 http upstream connect: -2
2024/02/19 13:29:00 [debug] 2009#0: *3 posix_memalign: 000055CAED31FBB0:128 @16
2024/02/19 13:29:00 [debug] 2009#0: *3 event timer add: 22: 60000:36115754
2024/02/19 13:29:00 [debug] 2009#0: *3 http finalize request: -4, "/api/manage/account/login?" a:1, c:2
2024/02/19 13:29:00 [debug] 2009#0: *3 http request count:2 blk:0
2024/02/19 13:29:00 [debug] 2009#0: *3 http run request: "/api/manage/account/login?"
2024/02/19 13:29:00 [debug] 2009#0: *3 http upstream check client, write event:1, "/api/manage/account/login"
2024/02/19 13:29:00 [debug] 2009#0: *3 http upstream request: "/api/manage/account/login?"
2024/02/19 13:29:00 [debug] 2009#0: *3 http upstream process header
2024/02/19 13:29:00 [error] 2009#0: *3 connect() failed (111: Connection refused) while connecting to upstream, client: 19
2.168.1.4, server: localhost, request: "POST /api/manage/account/login HTTP/1.1", upstream: "http://127.0.0.1:11111/manage
/account/login", host: "192.168.1.7", referrer: "http://192.168.1.7/user/login?redirect=%2F"
2024/02/19 13:29:00 [debug] 2009#0: *3 http next upstream, 2
2024/02/19 13:29:00 [debug] 2009#0: *3 free rr peer 1 4
2024/02/19 13:29:00 [debug] 2009#0: *3 finalize http upstream request: 502
2024/02/19 13:29:00 [debug] 2009#0: *3 finalize http proxy request
2024/02/19 13:29:00 [debug] 2009#0: *3 close http upstream connection: 22
2024/02/19 13:29:00 [debug] 2009#0: *3 free: 000055CAED31FBB0, unused: 48
2024/02/19 13:29:00 [debug] 2009#0: *3 event timer del: 22: 36115754
2024/02/19 13:29:00 [debug] 2009#0: *3 reusable connection: 0
2024/02/19 13:29:00 [debug] 2009#0: *3 http finalize request: 502, "/api/manage/account/login?" a:1, c:1
2024/02/19 13:29:00 [debug] 2009#0: *3 http special response: 502, "/api/manage/account/login?"
2024/02/19 13:29:00 [debug] 2009#0: *3 xslt filter header
2024/02/19 13:29:00 [debug] 2009#0: *3 HTTP/1.1 502 Bad Gateway
.....

显示upstream server连接失败,返回502错误 接下来查看nginx服务器设置的upstream server监听的11111端口

[root@localhost ~]# netstat -tulpn | grep 11111

发现没有进程监听11111端口,证明upstream server没有启动,目前严格按照文档中操作执行,数据库确认开启、防火墙、selinux都关闭。

询问老师后,发现是自己少启动了aops-manager服务(虽然这部分在文档中没有明确说明。。。)踩坑后发现,要保证aops web服务成功运行,需要在改完各个配置文件后按序执行以下命令。

aops-basedatabase mysql
aops-basedatabase elasticsearch
setenforce 0
systemctl stop firewalld
systemctl start aops-database
systemctl start aops-manager
systemctl start aops-web
menmazqj commented 9 months ago

验证资产管理使用手册

  1. 添加主机组
    [root@localhost ~]# aops group --action add --host_group_name group1 --description "zqj's group" --access_token "username=admin&password=changeme"
    {'code': 200, 'msg': 'operation succeed'}
  2. 查看主机组
    [root@localhost ~]# aops group --action query --access_token "username=admin&password=changeme"
    {'code': 200, 'msg': 'operation succeed', 'total_count': 2, 'total_page': 1}
    +-----------------+------------+-----------------+
    |   description   | host_count | host_group_name |
    +-----------------+------------+-----------------+
    | test host group |     0      |     test_zqj    |
    |   zqj's group   |     0      |      group1     |
    +-----------------+------------+-----------------+
  3. 添加主机
    [root@localhost ~]# useradd -m aops
    [root@localhost ~]# passwd aops
    Changing password for user aops.
    New password: 
    BAD PASSWORD: The password contains less than 3 character classes
    Retype new password: 
    passwd: all authentication tokens updated successfully.
    [root@localhost ~]# usermod -G wheel aops
    [root@localhost ~]# aops host --action add --host_name 22222 --host_group_name group1 --public_ip 192.168.0.1 --ssh_port 22 --management False --username test --password 123 --sudo_password aaa123 --key Aopsaops. --access_token "username=admin&password=changeme"
    {'code': 200, 'fail_list': [], 'host_list': ['17b350c6cf1d11eeb69e000c29c158bd'], 'msg': 'operation succeed', 'succeed_list': [{'host_group_id': 2, 'host_group_name': 'group1', 'host_id': '17b350c6cf1d11eeb69e000c29c158bd', 'host_name': '22222', 'management': False, 'public_ip': '192.168.0.1', 'ssh_port': 22, 'user': 'admin'}]}
  4. 查看主机
    
    [root@localhost ~]# aops host --action query --host_group_name group1 --access_token "username=admin&password=changeme"

{'code': 200, 'msg': 'operation succeed', 'total_count': 2, 'total_page': 1} +-----------------+----------------------------------+-----------+------------+-------------+----------+--------+ | host_group_name | host_id | host_name | management | public_ip | ssh_port | status | +-----------------+----------------------------------+-----------+------------+-------------+----------+--------+ | group1 | 17b350c6cf1d11eeb69e000c29c158bd | 22222 | False | 192.168.0.1 | 22 | None | | group1 | bba645d6cf1c11eeb69e000c29c158bd | 11111 | True | 192.168.1.7 | 22 | None | +-----------------+----------------------------------+-----------+------------+-------------+----------+--------+

5. 主机认证

[root@localhost ~]# aops certificate --key mi --access_token "username=admin&password=changeme" {'code': 200, 'msg': 'operation succeed'}

6. 删除主机

[root@localhost ~]# aops host --action delete --host_list 22222 --access_token "username=admin&password=changeme" {'code': 1103, 'fail_list': ['22222'], 'host_info': {}, 'msg': 'delete data from database fail', 'succeed_list': []}

文档中提供的命令无法成功执行,猜测原因,数据库中并没有提供--host_list相关参数,所以又试着将该参数改为--host_name,但是仍然出错,改为--host_id则显示无法识别该参数。

[root@localhost ~]# aops host --action delete --host_name 22222 --access_token "username=admin&password=changeme" No host will be deleted, because of the empty host list. Please check your host list if you want to delete hosts. [root@localhost ~]# aops host --action delete --host_id 5ac9f6d8cf2e11eeb69e000c29c158bd --access_token "username=admin&password=changeme" usage: A-Ops [-h] start ... A-Ops: error: unrecognized arguments: --host_id 5ac9f6d8cf2e11eeb69e000c29c158bd

7. 删除主机组

[root@localhost ~]# aops group --action delete --host_group_list test_zqj --access_token "username=admin&password=changeme" {'code': 200, 'deleted': ['test_zqj'], 'msg': 'operation succeed'}


### web页面都验证过,目前都和文档中的保持一致,没有出现问题,由于图片过多,就只放一张
![image](https://github.com/openEuler-RISCV/oerv-team/assets/39176667/94094786-ee38-4e1c-8ca8-e3b75c8399a4)
menmazqj commented 9 months ago

验证部署管理使用手册

  1. 在3.4节认证中,文档中有命令拼写错误 image 修改后成功执行
    [root@localhost ~]# aops certificate --key Testtest. --access_token "username=admin&password=changeme"
    {'code': 200, 'msg': 'operation succeed'}
  2. 配置ansible 文档中所有提到的关于zookeeper、kafka、prometheus、node_exporter、mysql、elasticsearch、fluentd、adoctor_check_executor、adoctor_check_scheduler、adoctor_diag_scheduler、adoctor_diag_executor、gala_ragdoll、gala_gopher、gala_spider的配置都在/usr/lib/python3.9/site-packages/aops_manager/deploy_manager/ansible_handler/inventory目录下,变量配置在/usr/lib/python3.9/site-packages/aops_manager/deploy_manager/ansible_handler/vars目录下 本次验证的3个主机的zookeeper配置如下,其他kafka、prometheus应用同理
    [root@localhost ~]# cat /usr/lib/python3.9/site-packages/aops_manager/deploy_manager/ansible_handler/inventory/zookeeper
    zookeeper_hosts:
    hosts:
    192.168.1.7:
      ansible_host: 192.168.1.7
      ansible_python_interpreter: /usr/bin/python3
      myid: 1
    192.168.1.9:
      ansible_host: 192.168.1.9
      ansible_python_interpreter: /usr/bin/python3
      myid: 2
    192.168.1.10:
      ansible_host: 192.168.1.10
      ansible_python_interpreter: /usr/bin/python3
      myid: 3
  3. 执行部署任务 命令行执行正常,log显示执行成功

    
    [root@localhost ~]# aops task --action execute --task_list 95c3e692ff3811ebbcd3a89d3a259eef --access_token "username=admin&password=changeme"
    
    The default task for installing: zookeeper, kafka, prometheus, node_exporter, mysql, elasticsearch, fluentd, gala-spider, gala-gopher, gala-ragdoll.
    
    These tasks may change your previous configuration.

The following host will be involved: ['192.168.1.7', '192.168.1.9', '192.168.1.10'] Please check if you want to continue y/n: y {'code': 200, 'msg': 'operation succeed'}

Done.

[root@localhost ~]# less /var/log/aops/uwsgi/manager.log [pid: 6728|app: 0|req: 204/204] 127.0.0.1 () {46 vars in 909 bytes} [Wed Feb 21 14:23:59 2024] POST /manage/host/group/get => generated 1 63 bytes in 9 msecs (HTTP/1.0 200) 2 headers in 72 bytes (1 switches on core 0) [pid: 6728|app: 0|req: 205/205] 127.0.0.1 () {46 vars in 915 bytes} [Wed Feb 21 14:24:01 2024] POST /manage/host/group/get => generated 1 63 bytes in 8 msecs (HTTP/1.0 200) 2 headers in 72 bytes (1 switches on core 0) [pid: 6728|app: 0|req: 206/206] 127.0.0.1 () {46 vars in 893 bytes} [Wed Feb 21 14:24:05 2024] POST /manage/task/get => generated 474 byt es in 14 msecs (HTTP/1.0 200) 2 headers in 72 bytes (1 switches on core 0) [pid: 6728|app: 0|req: 207/207] 127.0.0.1 () {46 vars in 901 bytes} [Wed Feb 21 14:24:05 2024] POST /manage/template/get => generated 90 bytes in 9 msecs (HTTP/1.0 200) 2 headers in 71 bytes (1 switches on core 0) [pid: 6728|app: 0|req: 208/208] 127.0.0.1 () {46 vars in 902 bytes} [Wed Feb 21 14:25:03 2024] DELETE /manage/task/delete => generated 39 bytes in 47 msecs (HTTP/1.0 200) 2 headers in 71 bytes (1 switches on core 0) [pid: 6728|app: 0|req: 209/209] 127.0.0.1 () {46 vars in 894 bytes} [Wed Feb 21 14:25:04 2024] POST /manage/task/get => generated 86 byte s in 9 msecs (HTTP/1.0 200) 2 headers in 71 bytes (1 switches on core 0) s in 9 msecs (HTTP/1.0 200) 2 headers in 71 bytes (1 switches on core 0) [pid: 6728|app: 0|req: 210/210] 127.0.0.1 () {46 vars in 894 bytes} [Wed Feb 21 14:25:22 2024] POST /manage/task/get => generated 474 byt es in 15 msecs (HTTP/1.0 200) 2 headers in 72 bytes (1 switches on core 0) [pid: 6728|app: 0|req: 211/211] 127.0.0.1 () {46 vars in 902 bytes} [Wed Feb 21 14:25:22 2024] POST /manage/template/get => generated 90 bytes in 10 msecs (HTTP/1.0 200) 2 headers in 71 bytes (1 switches on core 0) [pid: 6728|app: 0|req: 212/212] 127.0.0.1 () {46 vars in 898 bytes} [Wed Feb 21 14:29:35 2024] POST /manage/host/get => generated 415 byt es in 7 msecs (HTTP/1.0 200) 2 headers in 72 bytes (1 switches on core 0) [pid: 6728|app: 0|req: 213/213] 127.0.0.1 () {46 vars in 909 bytes} [Wed Feb 21 14:29:35 2024] POST /manage/host/group/get => generated 163 bytes in 9 msecs (HTTP/1.0 200) 2 headers in 72 bytes (1 switches on core 0) [pid: 6728|app: 0|req: 214/214] 127.0.0.1 () {46 vars in 906 bytes} [Wed Feb 21 14:29:41 2024] DELETE /manage/host/delete => generated 106 bytes in 25 msecs (HTTP/1.0 200) 2 headers in 72 bytes (1 switches on core 0) [pid: 6728|app: 0|req: 215/215] 127.0.0.1 () {46 vars in 898 bytes} [Wed Feb 21 14:29:41 2024] POST /manage/host/get => generated 251 bytes in 6 msecs (HTTP/1.0 200) 2 headers in 72 bytes (1 switches on core 0) [pid: 6728|app: 0|req: 216/216] 127.0.0.1 () {46 vars in 906 bytes} [Wed Feb 21 14:29:43 2024] DELETE /manage/host/delete => generated 106 bytes in 17 msecs (HTTP/1.0 200) 2 headers in 72 bytes (1 switches on core 0) [pid: 6728|app: 0|req: 217/217] 127.0.0.1 () {46 vars in 898 bytes} [Wed Feb 21 14:29:43 2024] POST /manage/host/get => generated 86 bytes in 6 msecs (HTTP/1.0 200) 2 headers in 71 bytes (1 switches on core 0) [pid: 6728|app: 0|req: 218/218] 127.0.0.1 () {46 vars in 921 bytes} [Wed Feb 21 14:29:44 2024] POST /manage/host/group/get => generated 163 bytes in 6 msecs (HTTP/1.0 200) 2 headers in 72 bytes (1 switches on core 0) [pid: 6728|app: 0|req: 219/219] 127.0.0.1 () {46 vars in 910 bytes} [Wed Feb 21 14:30:24 2024] POST /manage/host/add => generated 301 bytes in 29 msecs (HTTP/1.0 200) 2 headers in 72 bytes (1 switches on core 0) [pid: 6728|app: 0|req: 220/220] 127.0.0.1 () {46 vars in 897 bytes} [Wed Feb 21 14:30:24 2024] POST /manage/host/get => generated 247 bytes in 7 msecs (HTTP/1.0 200) 2 headers in 72 bytes (1 switches on core 0) [pid: 6728|app: 0|req: 221/221] 127.0.0.1 () {46 vars in 909 bytes} [Wed Feb 21 14:30:24 2024] POST /manage/host/group/get => generated 163 bytes in 7 msecs (HTTP/1.0 200) 2 headers in 72 bytes (1 switches on core 0) [pid: 6728|app: 0|req: 222/222] 127.0.0.1 () {46 vars in 921 bytes} [Wed Feb 21 14:30:25 2024] POST /manage/host/group/get => generated 163 bytes in 8 msecs (HTTP/1.0 200) 2 headers in 72 bytes (1 switches on core 0) [pid: 6728|app: 0|req: 223/223] 127.0.0.1 () {46 vars in 911 bytes} [Wed Feb 21 14:31:00 2024] POST /manage/host/add => generated 301 bytes in 22 msecs (HTTP/1.0 200) 2 headers in 72 bytes (1 switches on core 0) [pid: 6728|app: 0|req: 224/224] 127.0.0.1 () {46 vars in 898 bytes} [Wed Feb 21 14:31:00 2024] POST /manage/host/get => generated 409 bytes in 7 msecs (HTTP/1.0 200) 2 headers in 72 bytes (1 switches on core 0) [pid: 6728|app: 0|req: 225/225] 127.0.0.1 () {46 vars in 909 bytes} [Wed Feb 21 14:31:00 2024] POST /manage/host/group/get => generated 163 bytes in 5 msecs (HTTP/1.0 200) 2 headers in 72 bytes (1 switches on core 0) [pid: 6728|app: 0|req: 226/226] 127.0.0.1 () {46 vars in 921 bytes} [Wed Feb 21 14:31:02 2024] POST /manage/host/group/get => generated 163 bytes in 6 msecs (HTTP/1.0 200) 2 headers in 72 bytes (1 switches on core 0) [pid: 6728|app: 0|req: 227/227] 127.0.0.1 () {46 vars in 911 bytes} [Wed Feb 21 14:31:33 2024] POST /manage/host/add => generated 302 bytes in 21 msecs (HTTP/1.0 200) 2 headers in 72 bytes (1 switches on core 0) [pid: 6728|app: 0|req: 228/228] 127.0.0.1 () {46 vars in 898 bytes} [Wed Feb 21 14:31:33 2024] POST /manage/host/get => generated 572 bytes in 7 msecs (HTTP/1.0 200) 2 headers in 72 bytes (1 switches on core 0) [pid: 6728|app: 0|req: 229/229] 127.0.0.1 () {46 vars in 909 bytes} [Wed Feb 21 14:31:33 2024] POST /manage/host/group/get => generated 1 63 bytes in 7 msecs (HTTP/1.0 200) 2 headers in 72 bytes (1 switches on core 0) [pid: 6728|app: 0|req: 230/230] 127.0.0.1 () {46 vars in 894 bytes} [Wed Feb 21 14:31:37 2024] POST /manage/task/get => generated 474 byt es in 14 msecs (HTTP/1.0 200) 2 headers in 72 bytes (1 switches on core 0) [pid: 6728|app: 0|req: 231/231] 127.0.0.1 () {46 vars in 902 bytes} [Wed Feb 21 14:31:37 2024] POST /manage/template/get => generated 90 bytes in 9 msecs (HTTP/1.0 200) 2 headers in 71 bytes (1 switches on core 0) 2024-02-21 14:32:18,728 INFO view/post/162: Start run task ['95c3e692ff3811ebbcd3a89d3a259eef'] 2024-02-21 14:32:18,743 INFO view/post/178: Move inventory files from :/opt/aops/host_vars, host name is: 192.168.1.7 2024-02-21 14:32:18,743 INFO view/post/178: Move inventory files from :/opt/aops/host_vars, host name is: 192.168.1.9 2024-02-21 14:32:18,744 INFO view/post/178: Move inventory files from :/opt/aops/host_vars, host name is: 192.168.1.10 [pid: 6728|app: 0|req: 232/232] 127.0.0.1 () {46 vars in 902 bytes} [Wed Feb 21 14:32:18 2024] POST /manage/task/execute => generated 39 bytes in 17 msecs (HTTP/1.0 200) 2 headers in 71 bytes (1 switches on core 0) [WARNING]: file /usr/lib/python3.9/site-packages/aops_manager/deploy_manager/an sible_handler/roles/mysql/tasks/config_mysql.yml is empty and had no tasks to include 2024-02-21 14:32:20,179 INFO view/task_with_remove/202: Task 95c3e692ff3811ebbcd3a89d3a259eef execution succeeded. [pid: 6728|app: 0|req: 233/233] 127.0.0.1 () {46 vars in 894 bytes} [Wed Feb 21 14:32:20 2024] POST /manage/task/get => generated 474 byt es in 13 msecs (HTTP/1.0 200) 2 headers in 72 bytes (1 switches on core 0)

此外,这部分也有文档表述问题,文档中表示“执行进度与详细结果可以在/var/log/aops/manager.log中查看”,本机测试的路径为`/var/log/aops/uwsgi/manager.log`

4. 在192.168.1.9和192.168.1.10主机上查看部署情况

[root@localhost ~]# systemctl status zookeeper Unit zookeeper.service could not be found.


似乎部署没成功,没有安装zookeeper等组件,也没有启动相关服务
menmazqj commented 9 months ago

验证异常检测服务使用手册

  1. 安装完adoctor-check-scheduler adoctor-check-executor后,添加检测规则,显示无命令

    [root@localhost ~]# adoctor checkrule --action add --conf check_rule.json --access_token "1111"
    -bash: adoctor: command not found

    需要先安装adoctor-cli

    [root@localhost ~]# dnf install adoctor-cli
  2. 尝试web端导入规则 根据样例,本次使用的规则文件为

    [root@localhost ~]# cat check_rule.json 
    {
    "check_items": [
        {
            "check_item": "check_item2",
            "data_list": [{
                    "name": "data1",
                    "type": "kpi",
                    "label": {
                        "cpu": "1",
                        "mode": "irq"
                    }
                }],
        "condition": "$0>1",
        "plugin": "",
        "description": "data 1"
    }]
    }

    需要先启动服务

    [root@localhost ~]# /opt/kafka/bin/kafka-server-start.sh /opt/kafka/config/server.properties
    [root@localhost ~]# systemctl start adoctor-check-scheduler
    [root@localhost ~]# systemctl start adoctor-check-executor

    image 添加和删除规则正常,但是每次点击保持按钮和删除按钮,都会一直卡住,需要手动刷新一下

执行故障诊断后前端返回成功,但是查看结果为空 image

menmazqj commented 9 months ago

验证故障诊断服务使用手册

本次验证使用的故障树内容如下

{
        "node name":"重启类故障树",
        "value":null,
        "condition":"硬件问题 || 软件问题 || 内核问题",
        "description":"",
        "advice":"",
        "children": [
          {
            "node name":"硬件问题",
            "value":null,
            "condition":"硬件问题1 && 硬件问题2",
            "description":"出现硬件问题",
            "advice":"ccc ddd",
            "children": []
          }
        ]
}

前端使用该功能前,需要进行如下配置,和开启服务 配置adoctor-diag-executor连接的kafka地址,将ip替换为127.0.0.1

[root@localhost ~]# cat /etc/aops/diag_executor.ini 
[consumer]
kafka_server_list=127.0.0.1:9092
group_id=DiagGroup
enable_auto_commit=False
auto_offset_reset=earliest
timeout_ms=5
max_records=3

[topic]
name=DIAGNOSE_EXECUTE_REQ

启动kafka

[root@localhost ~]# /opt/kafka/bin/kafka-server-start.sh /opt/kafka/config/server.properties

启动adoctor-diag-scheduleradoctor-diag-executor

[root@localhost ~]# systemctl start adoctor-diag-scheduler
[root@localhost ~]# systemctl start adoctor-diag-executor

结果如下 image 查看报告后,内容都是空的

menmazqj commented 9 months ago

验证配置溯源服务使用手册

修改配置文件为

[root@localhost ~]# cat /etc/ragdoll/gala-ragdoll.conf 
[git]
git_dir = "/home/confTraceTest"
user_name = "menmazqj"
user_email = "qijia.oerv@isrc.iscas.ac.cn"

[collect]
collect_address = "http://192.168.1.7:11111"
collect_api = "/manage/config/collect"

[ragdoll]
port = 11114

需要先启动gala-ragdoll

[root@localhost ~]# systemctl start gala-ragdoll

需要先启动服务,后面都验证正常 image 文档中显示的前端内容都验证成功

此外,又尝试了一些文档中没提到的功能,比如同步主机,但是显示没有具体功能,如下图。 image

menmazqj commented 9 months ago

验证架构感知服务使用手册

该部分提供了安装步骤,安装没有问题,但是没有提供使用文档 尝试启动服务

[root@localhost ~]# systemctl start gala-gopher
[root@localhost ~]# systemctl start gala-spider
[root@localhost ~]# systemctl status gala-spider
● gala-spider.service - a-ops gala spider service
     Loaded: loaded (/usr/lib/systemd/system/gala-spider.service; disabled; vendor preset: disabled)
     Active: active (running) since Tue 2024-02-20 22:36:14 CST; 21h ago
   Main PID: 9924 (spider)
      Tasks: 2 (limit: 21420)
     Memory: 29.9M
     CGroup: /system.slice/gala-spider.service
             └─9924 /usr/bin/python3 /usr/bin/spider

Feb 21 20:29:09 localhost.localdomain spider[9924]:   File "/usr/lib/python3.9/site-packages/connexion/decorators/uri_parsing.py", line 149, in wrapper
Feb 21 20:29:09 localhost.localdomain spider[9924]:     response = function(request)
Feb 21 20:29:09 localhost.localdomain spider[9924]:   File "/usr/lib/python3.9/site-packages/connexion/decorators/validation.py", line 396, in wrapper
Feb 21 20:29:09 localhost.localdomain spider[9924]:     return function(request)
Feb 21 20:29:09 localhost.localdomain spider[9924]:   File "/usr/lib/python3.9/site-packages/connexion/decorators/parameter.py", line 115, in wrapper
Feb 21 20:29:09 localhost.localdomain spider[9924]:     return function(**kwargs)
Feb 21 20:29:09 localhost.localdomain spider[9924]:   File "/usr/lib/python3.9/site-packages/spider/controllers/gala_spider.py", line 27, in get_observed_entity_list
Feb 21 20:29:09 localhost.localdomain spider[9924]:     edges_table, edges_infos, nodes_table, lb_tables, vm_tables = node_entity_process()
Feb 21 20:29:09 localhost.localdomain spider[9924]: ValueError: not enough values to unpack (expected 5, got 4)
Feb 21 20:29:09 localhost.localdomain spider[9924]: 127.0.0.1 - - [21/Feb/2024 20:29:09] "GET /gala-spider/api/v1/get_entities HTTP/1.0" 500 -

gala-spider报错

menmazqj commented 9 months ago

本次验证问题汇总

  1. web端与命令行不一致 1.1 在添加主机设置密钥时,web端要求密钥必须有大写,特殊字符,长度大于等于8等要求,而命令行设置密钥时(--key参数), 则无任何限制。 1.2命令行删除主机无法成功执行,已尝试文档中提供的方法和命令行usage中的方法,都无法成功执行删除主机,web端可以正常删除

  2. 文档表述问题 2.1 部署管理使用手册第3.2节,提供了任务组件步骤配置文件修改方法,但是没有提供具体修改的文件路径,正确的路径为“/usr/lib/python3.9/site-packages/aops_manager/deploy_manager/tasks/任务名.yml” 2.2 部署管理使用手册第3.4节命令“ps certificate --key xxxx --access_token xxxx”应改为“aops certificate --key xxxx --access_token xxxx” 2.3 异常检测服务和故障诊断功能文档中,所有的adoctor命令都不存在

  3. Aops功能问题 3.1 Aops智能检测功能没有提供默认检测规则 3.2 web端智能定位-异常检测栏目无法添加规则 3.3 Aops故障诊断功能没有提供默认的故障诊断树 3.4 web端有很多失败操作,但是都没有返回具体的错误信息

jiewu9823 commented 9 months ago
  1. 指令命令删除主机时,--host_list 是在添加主机时返回的response里对应的 'host_list': ['17b350c6cf1d11eeb69e000c29c158bd'] ,也就是query出来的 host_id, 把 111 换成 17b350c6cf1d11eeb69e000c29c158bd 再试试
  2. 验证部署管理使用手册,部署没有成功,需要查看和确认为什么没有成功,是默认组件zookeeper, kafka, prometheus, node_exporter, mysql, elasticsearch, fluentd, gala-spider, gala-gopher, gala-ragdoll没有安装成功?为什么没有安装成功,是配置的问题吗?
  3. 异常检测服务 1)执行异常检测服务的前提是需要部署成功,因为没有部署成功,会导致异常检测服务无法正常执行,而且根据组件依赖关系图,check-executor和check-scheduler 依赖于 kafka,所以需要先部署成功后再次对该功能进行验证 2)无法新建异常检测规则,是怎么无法新建,是点击选择文件按钮导入后由于没有相关的确认按钮,刷新后在规则管理页面中没有新上传的规则,还是点击选择文件按钮选择文件后应该直接可以导入,但却没有导入? 3)在 新建异常检测规则界面提供了规则样例
  4. 故障诊断模块运行依赖于aops框架与异常检测服务,所以还是需要先部署成功后,再次验证故障诊断服务
  5. 配置溯源服务也是同样先部署成功后,再对该功能进行验证
menmazqj commented 9 months ago

指令命令删除主机时,--host_list 是在添加主机时返回的response里对应的 'host_list': ['17b350c6cf1d11eeb69e000c29c158bd'] ,也就是query出来的 host_id, 把 111 换成 17b350c6cf1d11eeb69e000c29c158bd 再试试

修改成response里的host_list后,该命令就可以成功执行了

[root@localhost ~]# aops host --action add --host_name 22222 --host_group_name group1 --public_ip 192.168.0.1 --ssh_port 22 --management False --username test --password 123 --sudo_password aaa123 --key Aopsaops. --access_token "username=admin&password=changeme"
{'code': 200, 'fail_list': [], 'host_list': ['98d7bf16cff411eeb69e000c29c158bd'], 'msg': 'operation succeed', 'succeed_list': [{'host_group_id': 2, 'host_group_name': 'group1', 'host_id': '98d7bf16cff411eeb69e000c29c158bd', 'host_name': '22222', 'management': False, 'public_ip': '192.168.0.1', 'ssh_port': 22, 'user': 'admin'}]}
[root@localhost ~]# aops host --action delete --host_list 98d7bf16cff411eeb69e000c29c158bd --access_token "username=admin&password=changeme"
{'code': 200, 'fail_list': [], 'msg': 'operation succeed', 'succeed_list': ['98d7bf16cff411eeb69e000c29c158bd']}

验证部署管理使用手册,部署没有成功,需要查看和确认为什么没有成功,是默认组件zookeeper, kafka, prometheus, node_exporter, mysql, elasticsearch, fluentd, gala-spider, gala-gopher, gala-ragdoll没有安装成功?为什么没有安装成功,是配置的问题吗?

确认为没有安装,配置都是按照官网文档中一步一步来的,具体见https://github.com/openEuler-RISCV/oerv-team/issues/104#issuecomment-1953582739

1)执行异常检测服务的前提是需要部署成功,因为没有部署成功,会导致异常检测服务无法正常执行,而且根据组件依赖关系图,check-executor和check-scheduler 依赖于 kafka,所以需要先部署成功后再次对该功能进行验证

验证完成,见上面更新的评论

2)无法新建异常检测规则,是怎么无法新建,是点击选择文件按钮导入后由于没有相关的确认按钮,刷新后在规则管理页面中没有新上传的规则,还是点击选择文件按钮选择文件后应该直接可以导入,但却没有导入?

已验证完成,前端无法导入是没看清保存按钮(按钮在右下角,太小了,又是纯白的。。。没注意),命令行无法导入是因为缺少adoctor命令,需要使用dnf安装(文档中没交代清楚)

配置溯源服务也是同样先部署成功后,再对该功能进行验证

已验证完成,更新在上面的评论中了

menmazqj commented 9 months ago

本次验证问题汇总

web端与命令行不一致 1.1 在添加主机设置密钥时,web端要求密钥必须有大写,特殊字符,长度大于等于8等要求,而命令行设置密钥时(--key参数), 则无任何限制。

文档表述问题 2.1 部署管理使用手册第3.2节,提供了任务组件步骤配置文件修改方法,但是没有提供具体修改的文件路径,正确的路径为“/usr/lib/python3.9/site-packages/aops_manager/deploy_manager/tasks/任务名.yml” 2.2 部署管理使用手册第3.4节命令“ps certificate --key xxxx --access_token xxxx”应改为“aops certificate --key xxxx --access_token xxxx” 2.3 异常检测服务和故障诊断功能文档中,所有的adoctor命令都不存在,需要dnf install adoctor-cli

Aops功能问题 3.1 gala-spider服务启动报错(文档中提到的配置都检测完成,其他服务都可以正常启动,除了gala-spider) 3.2 前端的架构感知功能失败,与3.1提到的问题相关 3.3 异常检测、故障诊断功能可以正常执行,前端返回执行成功提示,但两项功能执行后返回的结果为空 3.4 前端执行部署管理功能返回执行成功提示,但似乎没有任何有意义的动作,类似3.3

Jingwiw commented 9 months ago

参考 https://docs.openeuler.org/zh/docs/22.03_LTS/docs/A-Ops/overview.html 进行验证

本次验证主机信息如下 1707036918448

虚拟机网络设置为桥接模式,具体信息如下

[root@localhost ~]# ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: ens33: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether 00:0c:29:c1:58:bd brd ff:ff:ff:ff:ff:ff
    altname enp2s1
    inet 192.168.1.7/24 brd 192.168.1.255 scope global dynamic noprefixroute ens33
       valid_lft 57342sec preferred_lft 57342sec
    inet6 fe80::f444:1db:3bb3:caaa/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever

验证主机为什么是 x86

Jingwiw commented 9 months ago

默认 oerv 的所有任务环境基于 riscv

menmazqj commented 9 months ago

默认 oerv 的所有任务环境基于 riscv

1708939354855

已更改,并且下面验证23.09的A-Ops,之前验证的是22.03的

menmazqj commented 9 months ago

gala-gopher组件验证过程记录

Build过程

  1. build/build.sh脚本有些小错误(shell语法问题、文件路径有误、条件语句判断逻辑小问题等)

根据报错可以逐个修复

[root@openeuler gala-gopher]# build/build.sh --release 6.4.0-1.0.1.4.oe2309.riscv64
find: ‘/root/openeuler/src/probes’: No such file or directory
find: ‘/root/openeuler/src/probes/extends’: No such file or directory
build/build.sh: line 61: cd: /root/openeuler/src/probes: No such file or directory
PROBES_C_LIST:

PROBES_META_LIST:

/root/openeuler
build/build.sh: line 100: cd: /root/openeuler/src/common: No such file or directory
make: *** No targets specified and no makefile found.  Stop.
  1. 安装依赖

    [root@openeuler gala-gopher]# dnf install systemd cmake gcc-c++ elfutils-devel clang llvm libconfig-devel librdkafka-devel libmicrohttpd-devel libbpf-devel uthash-devel log4cplus-devel cjson-devel libcurl-devel

    其中有部分软件包在openEuler 23.09 RISC-V中没有提供,而obs中这些包都引入了,为了在本机上使用,这里在本地安装缺少的软件包。

  2. ebpf缺少openEuler 23.09 RISCV版本的vmlinux(下面尝试使用gala-gopher提供的生成脚本)

  3. 生成vmlinux的脚本(vmlinux_build.sh)有一些小问题,修改后可以正常使用 if [ ${PAHOLE_VERSION} != "v1.20" ]; then 改为 if [ ${PAHOLE_VERSION} == "v1.20" ]; then if [ ! d ${DWARVES_DIR} ]; then改为if [ ! -d ${DWARVES_DIR} ]; then

  4. gala-gopher提供的pahole软件包源码(dwarves-dfsg)安装方法有问题 dwarves-dfsg是debian和ubuntu发行版上引入的,由gala-gopher源码中提供,但是在openEuler上编译出现了大量问题,如CMake寻找的libbpf、libdw等库的路径与openEuler对应软件包安装路径不一致。修复寻找路径后,出现了链接库中函数的实现与头文件中定义出现偏差等问题。 openEuler 23.09的软件源中提供 了dwarves,其中包含了pahole,gala-gopher的vmlinux脚本可以修改为至今从dnf中安装

  5. 编译内核

jiewu9823 commented 8 months ago

结论是oerv因为缺少相关软件包,目前不支持A-Ops