Closed jiewu9823 closed 8 months ago
参考 https://docs.openeuler.org/zh/docs/22.03_LTS/docs/A-Ops/overview.html 进行验证
本次验证主机信息如下
虚拟机网络设置为桥接模式,具体信息如下
[root@localhost ~]# ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: ens33: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
link/ether 00:0c:29:c1:58:bd brd ff:ff:ff:ff:ff:ff
altname enp2s1
inet 192.168.1.7/24 brd 192.168.1.255 scope global dynamic noprefixroute ens33
valid_lft 57342sec preferred_lft 57342sec
inet6 fe80::f444:1db:3bb3:caaa/64 scope link noprefixroute
valid_lft forever preferred_lft forever
由于文档中命令和更改配置文件比较多,这里就不详细列举本此验证的具体细节了。 文档中从第四小节(安装下载)至第七小节(启动和关闭服务)都验证成功
第八小节(网页服务启动)验证结果
web服务正常
填入默认用户名(admin)和密码(changeme)后,显示错误 目前还在排查问题
上述错误排查: 查看nginx log文件
[root@localhost ~]# less /var/log/nginx/debug.log
......
Referer: http://192.168.1.7/user/login?redirect=%2F
Accept-Encoding: gzip, deflate
Accept-Language: zh-CN,zh;q=0.9,en;q=0.8,en-GB;q=0.7,en-US;q=0.6
"
2024/02/19 13:29:00 [debug] 2009#0: *3 http cleanup add: 000055CAED351858
2024/02/19 13:29:00 [debug] 2009#0: *3 get rr peer, try: 1
2024/02/19 13:29:00 [debug] 2009#0: *3 stream socket 22
2024/02/19 13:29:00 [debug] 2009#0: *3 epoll add connection: fd:22 ev:80002005
2024/02/19 13:29:00 [debug] 2009#0: *3 connect to 127.0.0.1:11111, fd:22 #5
2024/02/19 13:29:00 [debug] 2009#0: *3 http upstream connect: -2
2024/02/19 13:29:00 [debug] 2009#0: *3 posix_memalign: 000055CAED31FBB0:128 @16
2024/02/19 13:29:00 [debug] 2009#0: *3 event timer add: 22: 60000:36115754
2024/02/19 13:29:00 [debug] 2009#0: *3 http finalize request: -4, "/api/manage/account/login?" a:1, c:2
2024/02/19 13:29:00 [debug] 2009#0: *3 http request count:2 blk:0
2024/02/19 13:29:00 [debug] 2009#0: *3 http run request: "/api/manage/account/login?"
2024/02/19 13:29:00 [debug] 2009#0: *3 http upstream check client, write event:1, "/api/manage/account/login"
2024/02/19 13:29:00 [debug] 2009#0: *3 http upstream request: "/api/manage/account/login?"
2024/02/19 13:29:00 [debug] 2009#0: *3 http upstream process header
2024/02/19 13:29:00 [error] 2009#0: *3 connect() failed (111: Connection refused) while connecting to upstream, client: 19
2.168.1.4, server: localhost, request: "POST /api/manage/account/login HTTP/1.1", upstream: "http://127.0.0.1:11111/manage
/account/login", host: "192.168.1.7", referrer: "http://192.168.1.7/user/login?redirect=%2F"
2024/02/19 13:29:00 [debug] 2009#0: *3 http next upstream, 2
2024/02/19 13:29:00 [debug] 2009#0: *3 free rr peer 1 4
2024/02/19 13:29:00 [debug] 2009#0: *3 finalize http upstream request: 502
2024/02/19 13:29:00 [debug] 2009#0: *3 finalize http proxy request
2024/02/19 13:29:00 [debug] 2009#0: *3 close http upstream connection: 22
2024/02/19 13:29:00 [debug] 2009#0: *3 free: 000055CAED31FBB0, unused: 48
2024/02/19 13:29:00 [debug] 2009#0: *3 event timer del: 22: 36115754
2024/02/19 13:29:00 [debug] 2009#0: *3 reusable connection: 0
2024/02/19 13:29:00 [debug] 2009#0: *3 http finalize request: 502, "/api/manage/account/login?" a:1, c:1
2024/02/19 13:29:00 [debug] 2009#0: *3 http special response: 502, "/api/manage/account/login?"
2024/02/19 13:29:00 [debug] 2009#0: *3 xslt filter header
2024/02/19 13:29:00 [debug] 2009#0: *3 HTTP/1.1 502 Bad Gateway
.....
显示upstream server连接失败,返回502错误 接下来查看nginx服务器设置的upstream server监听的11111端口
[root@localhost ~]# netstat -tulpn | grep 11111
发现没有进程监听11111端口,证明upstream server没有启动,目前严格按照文档中操作执行,数据库确认开启、防火墙、selinux都关闭。
询问老师后,发现是自己少启动了aops-manager服务(虽然这部分在文档中没有明确说明。。。)踩坑后发现,要保证aops web服务成功运行,需要在改完各个配置文件后按序执行以下命令。
aops-basedatabase mysql
aops-basedatabase elasticsearch
setenforce 0
systemctl stop firewalld
systemctl start aops-database
systemctl start aops-manager
systemctl start aops-web
[root@localhost ~]# aops group --action add --host_group_name group1 --description "zqj's group" --access_token "username=admin&password=changeme"
{'code': 200, 'msg': 'operation succeed'}
[root@localhost ~]# aops group --action query --access_token "username=admin&password=changeme"
{'code': 200, 'msg': 'operation succeed', 'total_count': 2, 'total_page': 1}
+-----------------+------------+-----------------+
| description | host_count | host_group_name |
+-----------------+------------+-----------------+
| test host group | 0 | test_zqj |
| zqj's group | 0 | group1 |
+-----------------+------------+-----------------+
[root@localhost ~]# useradd -m aops
[root@localhost ~]# passwd aops
Changing password for user aops.
New password:
BAD PASSWORD: The password contains less than 3 character classes
Retype new password:
passwd: all authentication tokens updated successfully.
[root@localhost ~]# usermod -G wheel aops
[root@localhost ~]# aops host --action add --host_name 22222 --host_group_name group1 --public_ip 192.168.0.1 --ssh_port 22 --management False --username test --password 123 --sudo_password aaa123 --key Aopsaops. --access_token "username=admin&password=changeme"
{'code': 200, 'fail_list': [], 'host_list': ['17b350c6cf1d11eeb69e000c29c158bd'], 'msg': 'operation succeed', 'succeed_list': [{'host_group_id': 2, 'host_group_name': 'group1', 'host_id': '17b350c6cf1d11eeb69e000c29c158bd', 'host_name': '22222', 'management': False, 'public_ip': '192.168.0.1', 'ssh_port': 22, 'user': 'admin'}]}
[root@localhost ~]# aops host --action query --host_group_name group1 --access_token "username=admin&password=changeme"
{'code': 200, 'msg': 'operation succeed', 'total_count': 2, 'total_page': 1} +-----------------+----------------------------------+-----------+------------+-------------+----------+--------+ | host_group_name | host_id | host_name | management | public_ip | ssh_port | status | +-----------------+----------------------------------+-----------+------------+-------------+----------+--------+ | group1 | 17b350c6cf1d11eeb69e000c29c158bd | 22222 | False | 192.168.0.1 | 22 | None | | group1 | bba645d6cf1c11eeb69e000c29c158bd | 11111 | True | 192.168.1.7 | 22 | None | +-----------------+----------------------------------+-----------+------------+-------------+----------+--------+
5. 主机认证
[root@localhost ~]# aops certificate --key mi --access_token "username=admin&password=changeme" {'code': 200, 'msg': 'operation succeed'}
6. 删除主机
[root@localhost ~]# aops host --action delete --host_list 22222 --access_token "username=admin&password=changeme" {'code': 1103, 'fail_list': ['22222'], 'host_info': {}, 'msg': 'delete data from database fail', 'succeed_list': []}
文档中提供的命令无法成功执行,猜测原因,数据库中并没有提供--host_list相关参数,所以又试着将该参数改为--host_name,但是仍然出错,改为--host_id则显示无法识别该参数。
[root@localhost ~]# aops host --action delete --host_name 22222 --access_token "username=admin&password=changeme" No host will be deleted, because of the empty host list. Please check your host list if you want to delete hosts. [root@localhost ~]# aops host --action delete --host_id 5ac9f6d8cf2e11eeb69e000c29c158bd --access_token "username=admin&password=changeme" usage: A-Ops [-h] start ... A-Ops: error: unrecognized arguments: --host_id 5ac9f6d8cf2e11eeb69e000c29c158bd
7. 删除主机组
[root@localhost ~]# aops group --action delete --host_group_list test_zqj --access_token "username=admin&password=changeme" {'code': 200, 'deleted': ['test_zqj'], 'msg': 'operation succeed'}
### web页面都验证过,目前都和文档中的保持一致,没有出现问题,由于图片过多,就只放一张
![image](https://github.com/openEuler-RISCV/oerv-team/assets/39176667/94094786-ee38-4e1c-8ca8-e3b75c8399a4)
[root@localhost ~]# aops certificate --key Testtest. --access_token "username=admin&password=changeme"
{'code': 200, 'msg': 'operation succeed'}
/usr/lib/python3.9/site-packages/aops_manager/deploy_manager/ansible_handler/inventory
目录下,变量配置在/usr/lib/python3.9/site-packages/aops_manager/deploy_manager/ansible_handler/vars
目录下
本次验证的3个主机的zookeeper配置如下,其他kafka、prometheus应用同理
[root@localhost ~]# cat /usr/lib/python3.9/site-packages/aops_manager/deploy_manager/ansible_handler/inventory/zookeeper
zookeeper_hosts:
hosts:
192.168.1.7:
ansible_host: 192.168.1.7
ansible_python_interpreter: /usr/bin/python3
myid: 1
192.168.1.9:
ansible_host: 192.168.1.9
ansible_python_interpreter: /usr/bin/python3
myid: 2
192.168.1.10:
ansible_host: 192.168.1.10
ansible_python_interpreter: /usr/bin/python3
myid: 3
执行部署任务 命令行执行正常,log显示执行成功
[root@localhost ~]# aops task --action execute --task_list 95c3e692ff3811ebbcd3a89d3a259eef --access_token "username=admin&password=changeme"
The default task for installing: zookeeper, kafka, prometheus, node_exporter, mysql, elasticsearch, fluentd, gala-spider, gala-gopher, gala-ragdoll.
These tasks may change your previous configuration.
The following host will be involved: ['192.168.1.7', '192.168.1.9', '192.168.1.10'] Please check if you want to continue y/n: y {'code': 200, 'msg': 'operation succeed'}
Done.
[root@localhost ~]# less /var/log/aops/uwsgi/manager.log [pid: 6728|app: 0|req: 204/204] 127.0.0.1 () {46 vars in 909 bytes} [Wed Feb 21 14:23:59 2024] POST /manage/host/group/get => generated 1 63 bytes in 9 msecs (HTTP/1.0 200) 2 headers in 72 bytes (1 switches on core 0) [pid: 6728|app: 0|req: 205/205] 127.0.0.1 () {46 vars in 915 bytes} [Wed Feb 21 14:24:01 2024] POST /manage/host/group/get => generated 1 63 bytes in 8 msecs (HTTP/1.0 200) 2 headers in 72 bytes (1 switches on core 0) [pid: 6728|app: 0|req: 206/206] 127.0.0.1 () {46 vars in 893 bytes} [Wed Feb 21 14:24:05 2024] POST /manage/task/get => generated 474 byt es in 14 msecs (HTTP/1.0 200) 2 headers in 72 bytes (1 switches on core 0) [pid: 6728|app: 0|req: 207/207] 127.0.0.1 () {46 vars in 901 bytes} [Wed Feb 21 14:24:05 2024] POST /manage/template/get => generated 90 bytes in 9 msecs (HTTP/1.0 200) 2 headers in 71 bytes (1 switches on core 0) [pid: 6728|app: 0|req: 208/208] 127.0.0.1 () {46 vars in 902 bytes} [Wed Feb 21 14:25:03 2024] DELETE /manage/task/delete => generated 39 bytes in 47 msecs (HTTP/1.0 200) 2 headers in 71 bytes (1 switches on core 0) [pid: 6728|app: 0|req: 209/209] 127.0.0.1 () {46 vars in 894 bytes} [Wed Feb 21 14:25:04 2024] POST /manage/task/get => generated 86 byte s in 9 msecs (HTTP/1.0 200) 2 headers in 71 bytes (1 switches on core 0) s in 9 msecs (HTTP/1.0 200) 2 headers in 71 bytes (1 switches on core 0) [pid: 6728|app: 0|req: 210/210] 127.0.0.1 () {46 vars in 894 bytes} [Wed Feb 21 14:25:22 2024] POST /manage/task/get => generated 474 byt es in 15 msecs (HTTP/1.0 200) 2 headers in 72 bytes (1 switches on core 0) [pid: 6728|app: 0|req: 211/211] 127.0.0.1 () {46 vars in 902 bytes} [Wed Feb 21 14:25:22 2024] POST /manage/template/get => generated 90 bytes in 10 msecs (HTTP/1.0 200) 2 headers in 71 bytes (1 switches on core 0) [pid: 6728|app: 0|req: 212/212] 127.0.0.1 () {46 vars in 898 bytes} [Wed Feb 21 14:29:35 2024] POST /manage/host/get => generated 415 byt es in 7 msecs (HTTP/1.0 200) 2 headers in 72 bytes (1 switches on core 0) [pid: 6728|app: 0|req: 213/213] 127.0.0.1 () {46 vars in 909 bytes} [Wed Feb 21 14:29:35 2024] POST /manage/host/group/get => generated 163 bytes in 9 msecs (HTTP/1.0 200) 2 headers in 72 bytes (1 switches on core 0) [pid: 6728|app: 0|req: 214/214] 127.0.0.1 () {46 vars in 906 bytes} [Wed Feb 21 14:29:41 2024] DELETE /manage/host/delete => generated 106 bytes in 25 msecs (HTTP/1.0 200) 2 headers in 72 bytes (1 switches on core 0) [pid: 6728|app: 0|req: 215/215] 127.0.0.1 () {46 vars in 898 bytes} [Wed Feb 21 14:29:41 2024] POST /manage/host/get => generated 251 bytes in 6 msecs (HTTP/1.0 200) 2 headers in 72 bytes (1 switches on core 0) [pid: 6728|app: 0|req: 216/216] 127.0.0.1 () {46 vars in 906 bytes} [Wed Feb 21 14:29:43 2024] DELETE /manage/host/delete => generated 106 bytes in 17 msecs (HTTP/1.0 200) 2 headers in 72 bytes (1 switches on core 0) [pid: 6728|app: 0|req: 217/217] 127.0.0.1 () {46 vars in 898 bytes} [Wed Feb 21 14:29:43 2024] POST /manage/host/get => generated 86 bytes in 6 msecs (HTTP/1.0 200) 2 headers in 71 bytes (1 switches on core 0) [pid: 6728|app: 0|req: 218/218] 127.0.0.1 () {46 vars in 921 bytes} [Wed Feb 21 14:29:44 2024] POST /manage/host/group/get => generated 163 bytes in 6 msecs (HTTP/1.0 200) 2 headers in 72 bytes (1 switches on core 0) [pid: 6728|app: 0|req: 219/219] 127.0.0.1 () {46 vars in 910 bytes} [Wed Feb 21 14:30:24 2024] POST /manage/host/add => generated 301 bytes in 29 msecs (HTTP/1.0 200) 2 headers in 72 bytes (1 switches on core 0) [pid: 6728|app: 0|req: 220/220] 127.0.0.1 () {46 vars in 897 bytes} [Wed Feb 21 14:30:24 2024] POST /manage/host/get => generated 247 bytes in 7 msecs (HTTP/1.0 200) 2 headers in 72 bytes (1 switches on core 0) [pid: 6728|app: 0|req: 221/221] 127.0.0.1 () {46 vars in 909 bytes} [Wed Feb 21 14:30:24 2024] POST /manage/host/group/get => generated 163 bytes in 7 msecs (HTTP/1.0 200) 2 headers in 72 bytes (1 switches on core 0) [pid: 6728|app: 0|req: 222/222] 127.0.0.1 () {46 vars in 921 bytes} [Wed Feb 21 14:30:25 2024] POST /manage/host/group/get => generated 163 bytes in 8 msecs (HTTP/1.0 200) 2 headers in 72 bytes (1 switches on core 0) [pid: 6728|app: 0|req: 223/223] 127.0.0.1 () {46 vars in 911 bytes} [Wed Feb 21 14:31:00 2024] POST /manage/host/add => generated 301 bytes in 22 msecs (HTTP/1.0 200) 2 headers in 72 bytes (1 switches on core 0) [pid: 6728|app: 0|req: 224/224] 127.0.0.1 () {46 vars in 898 bytes} [Wed Feb 21 14:31:00 2024] POST /manage/host/get => generated 409 bytes in 7 msecs (HTTP/1.0 200) 2 headers in 72 bytes (1 switches on core 0) [pid: 6728|app: 0|req: 225/225] 127.0.0.1 () {46 vars in 909 bytes} [Wed Feb 21 14:31:00 2024] POST /manage/host/group/get => generated 163 bytes in 5 msecs (HTTP/1.0 200) 2 headers in 72 bytes (1 switches on core 0) [pid: 6728|app: 0|req: 226/226] 127.0.0.1 () {46 vars in 921 bytes} [Wed Feb 21 14:31:02 2024] POST /manage/host/group/get => generated 163 bytes in 6 msecs (HTTP/1.0 200) 2 headers in 72 bytes (1 switches on core 0) [pid: 6728|app: 0|req: 227/227] 127.0.0.1 () {46 vars in 911 bytes} [Wed Feb 21 14:31:33 2024] POST /manage/host/add => generated 302 bytes in 21 msecs (HTTP/1.0 200) 2 headers in 72 bytes (1 switches on core 0) [pid: 6728|app: 0|req: 228/228] 127.0.0.1 () {46 vars in 898 bytes} [Wed Feb 21 14:31:33 2024] POST /manage/host/get => generated 572 bytes in 7 msecs (HTTP/1.0 200) 2 headers in 72 bytes (1 switches on core 0) [pid: 6728|app: 0|req: 229/229] 127.0.0.1 () {46 vars in 909 bytes} [Wed Feb 21 14:31:33 2024] POST /manage/host/group/get => generated 1 63 bytes in 7 msecs (HTTP/1.0 200) 2 headers in 72 bytes (1 switches on core 0) [pid: 6728|app: 0|req: 230/230] 127.0.0.1 () {46 vars in 894 bytes} [Wed Feb 21 14:31:37 2024] POST /manage/task/get => generated 474 byt es in 14 msecs (HTTP/1.0 200) 2 headers in 72 bytes (1 switches on core 0) [pid: 6728|app: 0|req: 231/231] 127.0.0.1 () {46 vars in 902 bytes} [Wed Feb 21 14:31:37 2024] POST /manage/template/get => generated 90 bytes in 9 msecs (HTTP/1.0 200) 2 headers in 71 bytes (1 switches on core 0) 2024-02-21 14:32:18,728 INFO view/post/162: Start run task ['95c3e692ff3811ebbcd3a89d3a259eef'] 2024-02-21 14:32:18,743 INFO view/post/178: Move inventory files from :/opt/aops/host_vars, host name is: 192.168.1.7 2024-02-21 14:32:18,743 INFO view/post/178: Move inventory files from :/opt/aops/host_vars, host name is: 192.168.1.9 2024-02-21 14:32:18,744 INFO view/post/178: Move inventory files from :/opt/aops/host_vars, host name is: 192.168.1.10 [pid: 6728|app: 0|req: 232/232] 127.0.0.1 () {46 vars in 902 bytes} [Wed Feb 21 14:32:18 2024] POST /manage/task/execute => generated 39 bytes in 17 msecs (HTTP/1.0 200) 2 headers in 71 bytes (1 switches on core 0) [WARNING]: file /usr/lib/python3.9/site-packages/aops_manager/deploy_manager/an sible_handler/roles/mysql/tasks/config_mysql.yml is empty and had no tasks to include 2024-02-21 14:32:20,179 INFO view/task_with_remove/202: Task 95c3e692ff3811ebbcd3a89d3a259eef execution succeeded. [pid: 6728|app: 0|req: 233/233] 127.0.0.1 () {46 vars in 894 bytes} [Wed Feb 21 14:32:20 2024] POST /manage/task/get => generated 474 byt es in 13 msecs (HTTP/1.0 200) 2 headers in 72 bytes (1 switches on core 0)
此外,这部分也有文档表述问题,文档中表示“执行进度与详细结果可以在/var/log/aops/manager.log中查看”,本机测试的路径为`/var/log/aops/uwsgi/manager.log`
4. 在192.168.1.9和192.168.1.10主机上查看部署情况
[root@localhost ~]# systemctl status zookeeper Unit zookeeper.service could not be found.
似乎部署没成功,没有安装zookeeper等组件,也没有启动相关服务
安装完adoctor-check-scheduler adoctor-check-executor后,添加检测规则,显示无命令
[root@localhost ~]# adoctor checkrule --action add --conf check_rule.json --access_token "1111"
-bash: adoctor: command not found
需要先安装adoctor-cli
[root@localhost ~]# dnf install adoctor-cli
尝试web端导入规则 根据样例,本次使用的规则文件为
[root@localhost ~]# cat check_rule.json
{
"check_items": [
{
"check_item": "check_item2",
"data_list": [{
"name": "data1",
"type": "kpi",
"label": {
"cpu": "1",
"mode": "irq"
}
}],
"condition": "$0>1",
"plugin": "",
"description": "data 1"
}]
}
需要先启动服务
[root@localhost ~]# /opt/kafka/bin/kafka-server-start.sh /opt/kafka/config/server.properties
[root@localhost ~]# systemctl start adoctor-check-scheduler
[root@localhost ~]# systemctl start adoctor-check-executor
添加和删除规则正常,但是每次点击保持按钮和删除按钮,都会一直卡住,需要手动刷新一下
执行故障诊断后前端返回成功,但是查看结果为空
本次验证使用的故障树内容如下
{
"node name":"重启类故障树",
"value":null,
"condition":"硬件问题 || 软件问题 || 内核问题",
"description":"",
"advice":"",
"children": [
{
"node name":"硬件问题",
"value":null,
"condition":"硬件问题1 && 硬件问题2",
"description":"出现硬件问题",
"advice":"ccc ddd",
"children": []
}
]
}
前端使用该功能前,需要进行如下配置,和开启服务
配置adoctor-diag-executor
连接的kafka地址,将ip替换为127.0.0.1
[root@localhost ~]# cat /etc/aops/diag_executor.ini
[consumer]
kafka_server_list=127.0.0.1:9092
group_id=DiagGroup
enable_auto_commit=False
auto_offset_reset=earliest
timeout_ms=5
max_records=3
[topic]
name=DIAGNOSE_EXECUTE_REQ
启动kafka
[root@localhost ~]# /opt/kafka/bin/kafka-server-start.sh /opt/kafka/config/server.properties
启动adoctor-diag-scheduler
和adoctor-diag-executor
[root@localhost ~]# systemctl start adoctor-diag-scheduler
[root@localhost ~]# systemctl start adoctor-diag-executor
结果如下 查看报告后,内容都是空的
修改配置文件为
[root@localhost ~]# cat /etc/ragdoll/gala-ragdoll.conf
[git]
git_dir = "/home/confTraceTest"
user_name = "menmazqj"
user_email = "qijia.oerv@isrc.iscas.ac.cn"
[collect]
collect_address = "http://192.168.1.7:11111"
collect_api = "/manage/config/collect"
[ragdoll]
port = 11114
需要先启动gala-ragdoll
[root@localhost ~]# systemctl start gala-ragdoll
需要先启动服务,后面都验证正常 文档中显示的前端内容都验证成功
此外,又尝试了一些文档中没提到的功能,比如同步主机,但是显示没有具体功能,如下图。
该部分提供了安装步骤,安装没有问题,但是没有提供使用文档 尝试启动服务
[root@localhost ~]# systemctl start gala-gopher
[root@localhost ~]# systemctl start gala-spider
[root@localhost ~]# systemctl status gala-spider
● gala-spider.service - a-ops gala spider service
Loaded: loaded (/usr/lib/systemd/system/gala-spider.service; disabled; vendor preset: disabled)
Active: active (running) since Tue 2024-02-20 22:36:14 CST; 21h ago
Main PID: 9924 (spider)
Tasks: 2 (limit: 21420)
Memory: 29.9M
CGroup: /system.slice/gala-spider.service
└─9924 /usr/bin/python3 /usr/bin/spider
Feb 21 20:29:09 localhost.localdomain spider[9924]: File "/usr/lib/python3.9/site-packages/connexion/decorators/uri_parsing.py", line 149, in wrapper
Feb 21 20:29:09 localhost.localdomain spider[9924]: response = function(request)
Feb 21 20:29:09 localhost.localdomain spider[9924]: File "/usr/lib/python3.9/site-packages/connexion/decorators/validation.py", line 396, in wrapper
Feb 21 20:29:09 localhost.localdomain spider[9924]: return function(request)
Feb 21 20:29:09 localhost.localdomain spider[9924]: File "/usr/lib/python3.9/site-packages/connexion/decorators/parameter.py", line 115, in wrapper
Feb 21 20:29:09 localhost.localdomain spider[9924]: return function(**kwargs)
Feb 21 20:29:09 localhost.localdomain spider[9924]: File "/usr/lib/python3.9/site-packages/spider/controllers/gala_spider.py", line 27, in get_observed_entity_list
Feb 21 20:29:09 localhost.localdomain spider[9924]: edges_table, edges_infos, nodes_table, lb_tables, vm_tables = node_entity_process()
Feb 21 20:29:09 localhost.localdomain spider[9924]: ValueError: not enough values to unpack (expected 5, got 4)
Feb 21 20:29:09 localhost.localdomain spider[9924]: 127.0.0.1 - - [21/Feb/2024 20:29:09] "GET /gala-spider/api/v1/get_entities HTTP/1.0" 500 -
gala-spider
报错
web端与命令行不一致 1.1 在添加主机设置密钥时,web端要求密钥必须有大写,特殊字符,长度大于等于8等要求,而命令行设置密钥时(--key参数), 则无任何限制。 1.2命令行删除主机无法成功执行,已尝试文档中提供的方法和命令行usage中的方法,都无法成功执行删除主机,web端可以正常删除
文档表述问题
2.1 部署管理使用手册第3.2节,提供了任务组件步骤配置文件修改方法,但是没有提供具体修改的文件路径,正确的路径为“/usr/lib/python3.9/site-packages/aops_manager/deploy_manager/tasks/任务名.yml”
2.2 部署管理使用手册第3.4节命令“ps certificate --key xxxx --access_token xxxx”应改为“aops certificate --key xxxx --access_token xxxx”
2.3 异常检测服务和故障诊断功能文档中,所有的adoctor
命令都不存在
Aops功能问题 3.1 Aops智能检测功能没有提供默认检测规则 3.2 web端智能定位-异常检测栏目无法添加规则 3.3 Aops故障诊断功能没有提供默认的故障诊断树 3.4 web端有很多失败操作,但是都没有返回具体的错误信息
指令命令删除主机时,--host_list 是在添加主机时返回的response里对应的 'host_list': ['17b350c6cf1d11eeb69e000c29c158bd'] ,也就是query出来的 host_id, 把 111 换成 17b350c6cf1d11eeb69e000c29c158bd 再试试
修改成response里的host_list后,该命令就可以成功执行了
[root@localhost ~]# aops host --action add --host_name 22222 --host_group_name group1 --public_ip 192.168.0.1 --ssh_port 22 --management False --username test --password 123 --sudo_password aaa123 --key Aopsaops. --access_token "username=admin&password=changeme"
{'code': 200, 'fail_list': [], 'host_list': ['98d7bf16cff411eeb69e000c29c158bd'], 'msg': 'operation succeed', 'succeed_list': [{'host_group_id': 2, 'host_group_name': 'group1', 'host_id': '98d7bf16cff411eeb69e000c29c158bd', 'host_name': '22222', 'management': False, 'public_ip': '192.168.0.1', 'ssh_port': 22, 'user': 'admin'}]}
[root@localhost ~]# aops host --action delete --host_list 98d7bf16cff411eeb69e000c29c158bd --access_token "username=admin&password=changeme"
{'code': 200, 'fail_list': [], 'msg': 'operation succeed', 'succeed_list': ['98d7bf16cff411eeb69e000c29c158bd']}
验证部署管理使用手册,部署没有成功,需要查看和确认为什么没有成功,是默认组件zookeeper, kafka, prometheus, node_exporter, mysql, elasticsearch, fluentd, gala-spider, gala-gopher, gala-ragdoll没有安装成功?为什么没有安装成功,是配置的问题吗?
确认为没有安装,配置都是按照官网文档中一步一步来的,具体见https://github.com/openEuler-RISCV/oerv-team/issues/104#issuecomment-1953582739
1)执行异常检测服务的前提是需要部署成功,因为没有部署成功,会导致异常检测服务无法正常执行,而且根据组件依赖关系图,check-executor和check-scheduler 依赖于 kafka,所以需要先部署成功后再次对该功能进行验证
验证完成,见上面更新的评论
2)无法新建异常检测规则,是怎么无法新建,是点击选择文件按钮导入后由于没有相关的确认按钮,刷新后在规则管理页面中没有新上传的规则,还是点击选择文件按钮选择文件后应该直接可以导入,但却没有导入?
已验证完成,前端无法导入是没看清保存按钮(按钮在右下角,太小了,又是纯白的。。。没注意),命令行无法导入是因为缺少adoctor命令,需要使用dnf安装(文档中没交代清楚)
配置溯源服务也是同样先部署成功后,再对该功能进行验证
已验证完成,更新在上面的评论中了
web端与命令行不一致 1.1 在添加主机设置密钥时,web端要求密钥必须有大写,特殊字符,长度大于等于8等要求,而命令行设置密钥时(--key参数), 则无任何限制。
文档表述问题
2.1 部署管理使用手册第3.2节,提供了任务组件步骤配置文件修改方法,但是没有提供具体修改的文件路径,正确的路径为“/usr/lib/python3.9/site-packages/aops_manager/deploy_manager/tasks/任务名.yml”
2.2 部署管理使用手册第3.4节命令“ps certificate --key xxxx --access_token xxxx”应改为“aops certificate --key xxxx --access_token xxxx”
2.3 异常检测服务和故障诊断功能文档中,所有的adoctor命令都不存在,需要dnf install adoctor-cli
Aops功能问题
3.1 gala-spider
服务启动报错(文档中提到的配置都检测完成,其他服务都可以正常启动,除了gala-spider)
3.2 前端的架构感知功能失败,与3.1提到的问题相关
3.3 异常检测、故障诊断功能可以正常执行,前端返回执行成功提示,但两项功能执行后返回的结果为空
3.4 前端执行部署管理功能返回执行成功提示,但似乎没有任何有意义的动作,类似3.3
参考 https://docs.openeuler.org/zh/docs/22.03_LTS/docs/A-Ops/overview.html 进行验证
本次验证主机信息如下
虚拟机网络设置为桥接模式,具体信息如下
[root@localhost ~]# ip addr 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: ens33: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000 link/ether 00:0c:29:c1:58:bd brd ff:ff:ff:ff:ff:ff altname enp2s1 inet 192.168.1.7/24 brd 192.168.1.255 scope global dynamic noprefixroute ens33 valid_lft 57342sec preferred_lft 57342sec inet6 fe80::f444:1db:3bb3:caaa/64 scope link noprefixroute valid_lft forever preferred_lft forever
验证主机为什么是 x86
默认 oerv 的所有任务环境基于 riscv
默认 oerv 的所有任务环境基于 riscv
已更改,并且下面验证23.09的A-Ops,之前验证的是22.03的
根据报错可以逐个修复
[root@openeuler gala-gopher]# build/build.sh --release 6.4.0-1.0.1.4.oe2309.riscv64
find: ‘/root/openeuler/src/probes’: No such file or directory
find: ‘/root/openeuler/src/probes/extends’: No such file or directory
build/build.sh: line 61: cd: /root/openeuler/src/probes: No such file or directory
PROBES_C_LIST:
PROBES_META_LIST:
/root/openeuler
build/build.sh: line 100: cd: /root/openeuler/src/common: No such file or directory
make: *** No targets specified and no makefile found. Stop.
安装依赖
[root@openeuler gala-gopher]# dnf install systemd cmake gcc-c++ elfutils-devel clang llvm libconfig-devel librdkafka-devel libmicrohttpd-devel libbpf-devel uthash-devel log4cplus-devel cjson-devel libcurl-devel
其中有部分软件包在openEuler 23.09 RISC-V中没有提供,而obs中这些包都引入了,为了在本机上使用,这里在本地安装缺少的软件包。
ebpf缺少openEuler 23.09 RISCV版本的vmlinux(下面尝试使用gala-gopher提供的生成脚本)
生成vmlinux的脚本(vmlinux_build.sh
)有一些小问题,修改后可以正常使用
if [ ${PAHOLE_VERSION} != "v1.20" ]; then
改为 if [ ${PAHOLE_VERSION} == "v1.20" ]; then
if [ ! d ${DWARVES_DIR} ]; then
改为if [ ! -d ${DWARVES_DIR} ]; then
gala-gopher提供的pahole软件包源码(dwarves-dfsg)安装方法有问题 dwarves-dfsg是debian和ubuntu发行版上引入的,由gala-gopher源码中提供,但是在openEuler上编译出现了大量问题,如CMake寻找的libbpf、libdw等库的路径与openEuler对应软件包安装路径不一致。修复寻找路径后,出现了链接库中函数的实现与头文件中定义出现偏差等问题。 openEuler 23.09的软件源中提供 了dwarves,其中包含了pahole,gala-gopher的vmlinux脚本可以修改为至今从dnf中安装
编译内核
rpmbuild -ba ../SPEC/kernel.spec
,需要安装前置工具rpm-build
,vmlinux_build.sh
脚本中没有相关命令结论是oerv因为缺少相关软件包,目前不支持A-Ops
参考 https://docs.openeuler.org/zh/docs/22.03_LTS/docs/A-Ops/overview.html 进行验证 1.所有openEuler 官网提到的功能都验证一遍