r-universe-org / help

Support and bug tracker for R-universe
https://docs.r-universe.dev/
8 stars 2 forks source link

Intermittend deploy failures (server restarts due to oom) #110

Closed jeroen closed 2 years ago

jeroen commented 2 years ago

E.g. https://github.com/r-universe/certe-medical-epidemiology/actions/runs/1538751408/attempts/1

The actual error in this case is:

cranlike    | 2021-12-04T11:40:10.641561194Z [Debug] HTTP 400: MongoError: E11000 duplicate key error collection: cranlike.files.chunks index: files_id_1_n_1 dup key: { files_id: "b1977e63fb66d9595a1b6257459bcc3f", n: 0 }
cranlike    | 2021-12-04T11:40:10.650705697Z PUT /certe-medical-epidemiology/packages/AMR/1.7.1.9056/win/b1977e63fb66d9595a1b6257459bcc3f 400 314.601 ms - 1586

But it is probably caused by a crash earlier on. nginx starts giving a lot of [error] recv() failed

nginx       | 2021-12-04T11:40:05.917944041Z 141.101.105.135 - - [04/Dec/2021:11:40:05 +0000] "GET /avatars/openvolley.png HTTP/1.1" 302 118 "https://r-universe.dev/organizations/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.55 Safari/537.36"
nginx       | 2021-12-04T11:40:05.984069774Z 2021/12/04 11:40:05 [error] 22#22: *757773 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 52.250.12.83, server: ~^(?<subdomain>.+)\.r-universe\.dev$, request: "PUT /packages/markedTMB/0.3.0/mac/b0c678f0973a0a7f5b75585aa0af8038 HTTP/1.1", upstream: "http://172.18.0.5:3000/dsjohnson/packages/markedTMB/0.3.0/mac/b0c678f0973a0a7f5b75585aa0af8038", host: "dsjohnson.r-universe.dev"
nginx       | 2021-12-04T11:40:05.985526312Z 52.250.12.83 - ropensci [04/Dec/2021:11:40:05 +0000] "PUT /packages/markedTMB/0.3.0/mac/b0c678f0973a0a7f5b75585aa0af8038 HTTP/1.1" 502 157 "-" "curl/7.68.0"
nginx       | 2021-12-04T11:40:05.987697507Z 2021/12/04 11:40:05 [error] 22#22: *757779 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 52.183.41.194, server: ~^(?<subdomain>.+)\.r-universe\.dev$, request: "PUT /packages/AMR/1.7.1.9056/win/b1977e63fb66d9595a1b6257459bcc3f HTTP/1.1", upstream: "http://172.18.0.5:3000/certe-medical-epidemiology/packages/AMR/1.7.1.9056/win/b1977e63fb66d9595a1b6257459bcc3f", host: "certe-medical-epidemiology.r-universe.dev"
nginx       | 2021-12-04T11:40:05.990447137Z 2021/12/04 11:40:05 [error] 22#22: *757799 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 13.105.49.52, server: ~^(?<subdomain>.+)\.r-universe\.dev$, request: "GET /src/contrib/PACKAGES.gz HTTP/1.1", upstream: "http://172.18.0.5:3000/steffilazerte/src/contrib/PACKAGES.gz", host: "steffilazerte.r-universe.dev"
nginx       | 2021-12-04T11:40:05.990675840Z 13.105.49.52 - - [04/Dec/2021:11:40:05 +0000] "GET /src/contrib/PACKAGES.gz HTTP/1.1" 502 157 "-" "libcurl/7.64.1"
nginx       | 2021-12-04T11:40:05.991082008Z 2021/12/04 11:40:05 [error] 22#22: *757800 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 40.88.40.156, server: ~^(?<subdomain>.+)\.r-universe\.dev$, request: "PUT /packages/sorvi/0.8.17/mac/4bca41597eb9712814f8dcf27403c22f HTTP/1.1", upstream: "http://172.18.0.5:3000/ropengov/packages/sorvi/0.8.17/mac/4bca41597eb9712814f8dcf27403c22f", host: "ropengov.r-universe.dev"
nginx       | 2021-12-04T11:40:05.991171344Z 40.88.40.156 - ropensci [04/Dec/2021:11:40:05 +0000] "PUT /packages/sorvi/0.8.17/mac/4bca41597eb9712814f8dcf27403c22f HTTP/1.1" 502 157 "-" "curl/7.68.0"
nginx       | 2021-12-04T11:40:05.991246702Z 2021/12/04 11:40:05 [error] 22#22: *757802 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 13.105.49.53, server: ~^(?<subdomain>.+)\.r-universe\.dev$, request: "GET /src/contrib/PACKAGES.rds HTTP/1.1", upstream: "http://172.18.0.5:3000/gavinsimpson/src/contrib/PACKAGES.rds", host: "gavinsimpson.r-universe.dev"

Server runs out of memory but it is unclear to me what is taking up so much memory.

The kern.log shows that mongo takes 700m and node 500m, which is more than I would expect, but the server has 4gb.

Perhaps also part of the problem is that these servers don't have any swap, that is probably one way to mitigate this

jeroen@packages:~$ free -m
              total        used        free      shared  buff/cache   available
Mem:           3936        2141         187           2        1607        1566
Swap:             0           0           0
jeroen commented 2 years ago

Maybe invest in some better monitoring as well, similar to @colinfay https://twitter.com/_ColinFay/status/1357681042472247297

jeroen commented 2 years ago

Added 4GB swap with swapiness=10 as in: https://www.digitalocean.com/community/tutorials/how-to-add-swap-space-on-ubuntu-20-04