sjtug / mirror-requests

新镜像请求 & BUG 汇报
https://mirrors.sjtug.sjtu.edu.cn
48 stars 2 forks source link

[事故总结] 2018/05/26服务不可用 #17

Open htfy96 opened 6 years ago

htfy96 commented 6 years ago

类别:Bug调查

根据可用性检测显示,从2018/05/25 21:29:32到2018/05/26 11:01:55秒,软件源服务的所有/lug/v1/API与反向代理(校内、校外)均无法正常工作,同步功能均失败。

没有及时发现与响应

没有被及时注意到的原因主要有两点:

以上任何一点若能正常工作都可以更早地注意到问题。

解决方案:

问题本身与调查过程

在05/26 10时左右接到群里有人反馈镜像源不可用之后,首先在本地确定了故障范围:/lug/v1 API、反向代理与同步功能,这三项都依赖代理的正常工作,初步猜测是v2ray代理出现了问题。ssh上了服务器检查v2ray的log,发现最后一条是2018/05/25 21:28的记录,发现已无法正常工作。查看进程发现v2ray容器内无进程,初步猜测是v2ray进程被kill了,利用dmesg查看kill记录,发现原因是OOM kill:

[Fri May 25 20:48:02 2018] v2ray invoked oom-killer: gfp_mask=0x14000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=0
[Fri May 25 20:48:02 2018] v2ray cpuset=f0cfdc69eae8aa9fbf02cf73ab8b6951fceaa4b0d2cc5124d7990efdf8d48c27 mems_allowed=0-1
[Fri May 25 20:48:02 2018] CPU: 11 PID: 7408 Comm: v2ray Not tainted 4.15.4-300.fc27.x86_64 #1
[Fri May 25 20:48:02 2018] Hardware name: Lenovo Lenovo WQ R520 G7/Tyan Tank GT20-B7002LNV, BIOS 'V1.08A ' 05/19/2011
[Fri May 25 20:48:02 2018] Call Trace:
[Fri May 25 20:48:02 2018]  dump_stack+0x5c/0x85
[Fri May 25 20:48:02 2018]  dump_header+0x6e/0x289
[Fri May 25 20:48:02 2018]  oom_kill_process+0x218/0x420
[Fri May 25 20:48:02 2018]  out_of_memory+0x2ea/0x4f0
[Fri May 25 20:48:02 2018]  mem_cgroup_out_of_memory+0x49/0x80
[Fri May 25 20:48:02 2018]  mem_cgroup_oom_synchronize+0x2ef/0x340
[Fri May 25 20:48:02 2018]  ? mem_cgroup_css_online+0x30/0x30
[Fri May 25 20:48:02 2018]  pagefault_out_of_memory+0x32/0x77
[Fri May 25 20:48:02 2018]  __do_page_fault+0x4a7/0x4e0
[Fri May 25 20:48:02 2018]  do_page_fault+0x32/0x110
[Fri May 25 20:48:02 2018]  ? page_fault+0x36/0x60
[Fri May 25 20:48:02 2018]  page_fault+0x4c/0x60
[Fri May 25 20:48:02 2018] RIP: 0033:0x8ef0b3
[Fri May 25 20:48:02 2018] RSP: 002b:000000c420721c80 EFLAGS: 00010246
[Fri May 25 20:48:02 2018] Task in /docker/f0cfdc69eae8aa9fbf02cf73ab8b6951fceaa4b0d2cc5124d7990efdf8d48c27 killed as a result of limit of /docker/f0cfdc69eae8aa9fbf02cf73ab8b6951fceaa4b0d2cc5124d7990efdf8d48c27
[Fri May 25 20:48:02 2018] memory: usage 2097152kB, limit 2097152kB, failcnt 0
[Fri May 25 20:48:02 2018] memory+swap: usage 2097152kB, limit 2097152kB, failcnt 76
[Fri May 25 20:48:02 2018] kmem: usage 6984kB, limit 9007199254740988kB, failcnt 0
[Fri May 25 20:48:02 2018] Memory cgroup stats for /docker/f0cfdc69eae8aa9fbf02cf73ab8b6951fceaa4b0d2cc5124d7990efdf8d48c27: cache:0KB rss:2090168KB rss_huge:577536KB shmem:0KB mapped_file:0KB dirty:0KB writeback:0KB swap:0KB inactive_anon:92928KB active_anon:1997100KB inactive_file:0KB active_file:0KB unevictable:0KB
[Fri May 25 20:48:02 2018] [ pid ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[Fri May 25 20:48:02 2018] [ 7173]     0  7173   396111   393956  3219456        0             0 v2ray
[Fri May 25 20:48:02 2018] Memory cgroup out of memory: Kill process 7173 (v2ray) score 752 or sacrifice child
[Fri May 25 20:48:02 2018] Killed process 7173 (v2ray) total-vm:1584444kB, anon-rss:1566472kB, file-rss:9524kB, shmem-rss:0kB
[Fri May 25 20:48:02 2018] oom_reaper: reaped process 7173 (v2ray), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[Fri May 25 20:48:02 2018] br-c93d894a4bad: port 3(veth8a163f5) entered disabled state
[Fri May 25 20:48:02 2018] vethf5da6ea: renamed from eth0
[Fri May 25 20:48:02 2018] br-c93d894a4bad: port 3(veth8a163f5) entered disabled state
[Fri May 25 20:48:02 2018] device veth8a163f5 left promiscuous mode

初步猜测是因为v2ray 死锁、资源泄漏等问题导致的OOM kill。登录monitor.mirrors.sjtug.org:3000查看故障前后的内存占用: screenshot from 2018-05-26 11-33-05

可以看出在OOM kill(48分)前后,并没有急剧的内存占用上升或CPU利用率上升,这说明内存占用过高并非由于一时的bug或死循环在短时间内耗尽内存,而是来自逐渐增长的泄漏。

v2ray最后的log:

2018/05/25 12:48:31 [Info] [2979591631] Proxy|HTTP: request to Method [GET] Host [ftp.sjtu.edu.cn] with URL [http://ftp.sjtu.edu.cn/archlinux/iso/2018.05.01/archlinux-2018.05.01-x86_64.iso]
2018/05/25 12:48:31 172.18.0.2:60084 accepted http://ftp.sjtu.edu.cn/archlinux/iso/2018.05.01/archlinux-2018.05.01-x86_64.iso  
2018/05/25 12:48:31 [Info] [2979591631] App|Dispatcher: taking detour [direct] for [tcp:ftp.sjtu.edu.cn:80]
2018/05/25 12:48:31 [Info] [2979591631] Proxy|Freedom: opening connection to tcp:ftp.sjtu.edu.cn:80
2018/05/25 12:48:31 [Info] [2979591631] Transport|Internet|TCP: dialing TCP to tcp:ftp.sjtu.edu.cn:80
2018/05/25 12:48:31 [Info] [2411693991] Proxy|HTTP: request to Method [GET] Host [ftp.sjtu.edu.cn] with URL [http://ftp.sjtu.edu.cn/archlinux/iso/2018.05.01/archlinux-2018.05.01-x86_64.iso]
2018/05/25 12:48:31 172.18.0.2:60088 accepted http://ftp.sjtu.edu.cn/archlinux/iso/2018.05.01/archlinux-2018.05.01-x86_64.iso  
2018/05/25 12:48:31 [Info] [2411693991] App|Dispatcher: taking detour [direct] for [tcp:ftp.sjtu.edu.cn:80]
2018/05/25 12:48:31 [Info] [2411693991] Proxy|Freedom: opening connection to tcp:ftp.sjtu.edu.cn:80
2018/05/25 12:48:31 [Info] [2411693991] Transport|Internet|TCP: dialing TCP to tcp:ftp.sjtu.edu.cn:80
2018/05/25 12:48:31 [Info] [2481794292] Proxy|HTTP: request to Method [GET] Host [ftp.sjtu.edu.cn] with URL [http://ftp.sjtu.edu.cn/archlinux/iso/2018.05.01/archlinux-2018.05.01-x86_64.iso]
2018/05/25 12:48:31 172.18.0.2:60092 accepted http://ftp.sjtu.edu.cn/archlinux/iso/2018.05.01/archlinux-2018.05.01-x86_64.iso  
2018/05/25 12:48:31 [Info] [2481794292] App|Dispatcher: taking detour [direct] for [tcp:ftp.sjtu.edu.cn:80]
2018/05/25 12:48:31 [Info] [2481794292] Proxy|Freedom: opening connection to tcp:ftp.sjtu.edu.cn:80
2018/05/25 12:48:31 [Info] [2481794292] Transport|Internet|TCP: dialing TCP to tcp:ftp.sjtu.edu.cn:80
2018/05/25 12:48:32 [Info] [112547227] Proxy|HTTP: request to Method [GET] Host [ftp.sjtu.edu.cn] with URL [http://ftp.sjtu.edu.cn/archlinux/iso/2018.05.01/archlinux-2018.05.01-x86_64.iso]
2018/05/25 12:48:32 172.18.0.2:60096 accepted http://ftp.sjtu.edu.cn/archlinux/iso/2018.05.01/archlinux-2018.05.01-x86_64.iso  
2018/05/25 12:48:32 [Info] [112547227] App|Dispatcher: taking detour [direct] for [tcp:ftp.sjtu.edu.cn:80]
2018/05/25 12:48:32 [Info] [112547227] Proxy|Freedom: opening connection to tcp:ftp.sjtu.edu.cn:80
2018/05/25 12:48:32 [Info] [112547227] Transport|Internet|TCP: dialing TCP to tcp:ftp.sjtu.edu.cn:80
2018/05/25 12:48:33 172.18.0.2:60100 accepted http://ftp.sjtu.edu.cn/archlinux/iso/2018.05.01/archlinux-2018.05.01-x86_64.iso  
2018/05/25 12:48:33 [Info] [3533665064] Proxy|HTTP: request to Method [GET] Host [ftp.sjtu.edu.cn] with URL [http://ftp.sjtu.edu.cn/archlinux/iso/2018.05.01/archlinux-2018.05.01-x86_64.iso]
2018/05/25 12:48:33 [Info] [3533665064] App|Dispatcher: taking detour [direct] for [tcp:ftp.sjtu.edu.cn:80]
2018/05/25 12:48:33 [Info] [3533665064] Proxy|Freedom: opening connection to tcp:ftp.sjtu.edu.cn:80
2018/05/25 12:48:33 [Info] [3533665064] Transport|Internet|TCP: dialing TCP to tcp:ftp.sjtu.edu.cn:80
2018/05/25 12:48:34 [Info] [3089256634] Proxy|HTTP: request to Method [GET] Host [ftp.sjtu.edu.cn] with URL [http://ftp.sjtu.edu.cn/archlinux/iso/2018.05.01/archlinux-2018.05.01-x86_64.iso]
2018/05/25 12:48:34 172.18.0.2:60104 accepted http://ftp.sjtu.edu.cn/archlinux/iso/2018.05.01/archlinux-2018.05.01-x86_64.iso  
2018/05/25 12:48:34 [Info] [3089256634] App|Dispatcher: taking detour [direct] for [tcp:ftp.sjtu.edu.cn:80]
2018/05/25 12:48:34 [Info] [3089256634] Proxy|Freedom: opening connection to tcp:ftp.sjtu.edu.cn:80
2018/05/25 12:48:34 [Info] [3089256634] Transport|Internet|TCP: dialing TCP to tcp:ftp.sjtu.edu.cn:80
2018/05/25 12:48:39 [Info] [3117052702] Proxy|HTTP: request to Method [GET] Host [ftp.sjtu.edu.cn] with URL [http://ftp.sjtu.edu.cn/archlinux/iso/2018.05.01/archlinux-2018.05.01-x86_64.iso]
2018/05/25 12:48:39 172.18.0.2:60112 accepted http://ftp.sjtu.edu.cn/archlinux/iso/2018.05.01/archlinux-2018.05.01-x86_64.iso  
2018/05/25 12:48:39 [Info] [3117052702] App|Dispatcher: taking detour [direct] for [tcp:ftp.sjtu.edu.cn:80]
2018/05/25 12:48:39 [Info] [3117052702] Proxy|Freedom: opening connection to tcp:ftp.sjtu.edu.cn:80
2018/05/25 12:48:39 [Info] [3117052702] Transport|Internet|TCP: dialing TCP to tcp:ftp.sjtu.edu.cn:80
2018/05/25 12:48:39 172.18.0.2:60116 accepted http://ftp.sjtu.edu.cn/archlinux/iso/2018.05.01/archlinux-2018.05.01-x86_64.iso  
2018/05/25 12:48:39 [Info] [3671255154] Proxy|HTTP: request to Method [GET] Host [ftp.sjtu.edu.cn] with URL [http://ftp.sjtu.edu.cn/archlinux/iso/2018.05.01/archlinux-2018.05.01-x86_64.iso]
2018/05/25 12:48:39 [Info] [3671255154] App|Dispatcher: taking detour [direct] for [tcp:ftp.sjtu.edu.cn:80]
2018/05/25 12:48:39 [Info] [3671255154] Proxy|Freedom: opening connection to tcp:ftp.sjtu.edu.cn:80
2018/05/25 12:48:39 [Info] [3671255154] Transport|Internet|TCP: dialing TCP to tcp:ftp.sjtu.edu.cn:80
2018/05/25 12:48:39 [Info] [2193308538] Proxy|HTTP: request to Method [GET] Host [ftp.sjtu.edu.cn] with URL [http://ftp.sjtu.edu.cn/archlinux/iso/2018.05.01/archlinux-2018.05.01-x86_64.iso]
2018/05/25 12:48:39 172.18.0.2:60120 accepted http://ftp.sjtu.edu.cn/archlinux/iso/2018.05.01/archlinux-2018.05.01-x86_64.iso  
2018/05/25 12:48:39 [Info] [2193308538] App|Dispatcher: taking detour [direct] for [tcp:ftp.sjtu.edu.cn:80]
2018/05/25 12:48:39 [Info] [2193308538] Proxy|Freedom: opening connection to tcp:ftp.sjtu.edu.cn:80
2018/05/25 12:48:39 [Info] [2193308538] Transport|Internet|TCP: dialing TCP to tcp:ftp.sjtu.edu.cn:80
2018/05/25 12:48:41 [Info] [465477382] Proxy|HTTP: request to Method [GET] Host [ftp.sjtu.edu.cn] with URL [http://ftp.sjtu.edu.cn/archlinux/iso/2018.05.01/archlinux-2018.05.01-x86_64.iso]
2018/05/25 12:48:41 172.18.0.2:60124 accepted http://ftp.sjtu.edu.cn/archlinux/iso/2018.05.01/archlinux-2018.05.01-x86_64.iso
2018/05/25 12:48:41 [Info] [465477382] App|Dispatcher: taking detour [direct] for [tcp:ftp.sjtu.edu.cn:80]
2018/05/25 12:48:41 [Info] [465477382] Proxy|Freedom: opening connection to tcp:ftp.sjtu.edu.cn:80
2018/05/25 12:48:41 [Info] [465477382] Transport|Internet|TCP: dialing TCP to tcp:ftp.sjtu.edu.cn:80
// OOM killed

暂时解决方案:

但根本问题没有解决

htfy96 commented 6 years ago

https://github.com/v2ray/v2ray-core/issues/1120