100% CPU usage and spawns process for host user

nflaig commented 1 year ago

Is there an existing issue for this?

[X] I have searched the existing issues

Version

3.0.1

Describe The Bug

Causes 100% CPU utilization 24/7 and spawns process for host user.

CPU usage

ps -p 3607138 -o %cpu,%mem,cmd
%CPU %MEM CMD
99.9  0.0 node index.mjs

User processes

ps -u <user>
    PID TTY          TIME CMD
3607110 ?        00:00:00 npm run start:p
3607137 ?        00:00:00 sh
3607138 ?        3-04:40:07 node
3607139 ?        00:00:00 node
3892167 ?        00:00:00 npm run start:p
3892187 ?        00:00:00 sh
3892188 ?        20:42:45 node
3892189 ?        00:00:00 node
3988157 ?        00:00:00 systemd
3988158 ?        00:00:00 (sd-pam)
3988302 ?        00:00:00 sshd
3988304 pts/0    00:00:00 bash
3992888 pts/0    00:00:00 ps

Steps To Reproduce

Run cloudflare-dds via docker, see https://github.com/eth-educators/eth-docker/blob/63b651d56e83d7b513d9350cd00eca3383eecfc0/traefik-cf.yml#L52

Expected Behavior

Should not spawn process for host user and cause 100% CPU utilization 24/7.

ENV Configuration

CF_DNS__CRON=*/5 * * * *
ZONE=
CF_DNS__LOG_LEVEL=info
USER=root
npm_config_user_agent=npm/9.5.1 node/v18.16.0 linux x64 workspaces/false
CF_DNS__DOMAINS_0__ZONE_NAME=<masked>
NODE_VERSION=18.16.0
YARN_VERSION=1.22.19
HOSTNAME=469acd70073b
npm_node_execpath=/usr/local/bin/node
SHLVL=4
npm_config_noproxy=
HOME=/root
OLDPWD=/root
CWD=/app
npm_package_json=/app/package.json
API=
npm_config_userconfig=/root/.npmrc
npm_config_local_prefix=/app
CF_DNS__DOMAINS_0__ZONE_ID=
CF_DNS__DOMAINS_0__CREATE=true
COLOR=0
npm_config_metrics_registry=https://registry.npmjs.org/
CF_DNS__DOMAINS_0__TYPE=A
LOGNAME=root
RUNONCE=
IPV6=
npm_config_cache=/home/node/.npm
npm_config_node_gyp=/usr/local/lib/node_modules/npm/node_modules/node-gyp/bin/node-gyp.js
PATH=/app/node_modules/.bin:/node_modules/.bin:/usr/local/lib/node_modules/npm/node_modules/@npmcli/run-script/lib/node-gyp-bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
PROXY=
EMAIL=
CF_DNS__DOMAINS_0__NAME=<masked>
NODE=/usr/local/bin/node
npm_package_name=cloudflare-ddns
CF_DNS__AUTH__SCOPED_TOKEN=<masked>
npm_lifecycle_script=node index.mjs | pino-pretty --translateTime yyyy-mm-dd'T'HH:MM:sso
HOST=
SHELL=/bin/ash
npm_package_version=3.0.1
npm_lifecycle_event=start:pretty
TTL=
NPM_CONFIG_UPDATE_NOTIFIER=false
FORCE_CREATE=
npm_config_globalconfig=/app/.npm/etc/npmrc
npm_config_init_module=/root/.npm-init.js
PWD=/app
npm_execpath=/usr/local/lib/node_modules/npm/bin/npm-cli.js
npm_config_global_prefix=/app/.npm
npm_command=run-script
NODE_ENV=production
CF_DNS__DOMAINS_0__PROXIED=true
CF_DNS_CONFIG=/app/config.yaml
NPM_CONFIG_PREFIX=/app/.npm
INIT_CWD=/app
EDITOR=vi

JS Configuration

No response

Relevant Output Log

No response

joshuaavalon commented 1 year ago

I cannot reproduce the issue. Can you provide the log?

Also, you do not need two containers for v4 and v6. The second config can be replace 0 with 1.

- CF_DNS__DOMAINS_0__ZONE_NAME=${DOMAIN}
- CF_DNS__DOMAINS_0__ZONE_ID=${CF_ZONE_ID:-}
- CF_DNS__DOMAINS_0__NAME=${DDNS_SUBDOMAIN}.${DOMAIN}
- CF_DNS__DOMAINS_0__PROXIED=${DDNS_PROXY}
- CF_DNS__DOMAINS_0__CREATE=true
- CF_DNS__DOMAINS_0__TYPE=A
- CF_DNS__DOMAINS_1__ZONE_NAME=${DOMAIN}
- CF_DNS__DOMAINS_1__ZONE_ID=${CF_ZONE_ID:-}
- CF_DNS__DOMAINS_1__NAME=${DDNS_SUBDOMAIN}.${DOMAIN}
- CF_DNS__DOMAINS_1__PROXIED=${DDNS_PROXY}
- CF_DNS__DOMAINS_1__CREATE=true
- CF_DNS__DOMAINS_1__TYPE=AAAA

nflaig commented 1 year ago

I cannot reproduce the issue. Can you provide the log?

I will dig into this later, just dropped issue yesterday because I thought maybe it is something obvious in the code.

Also, you do not need two containers for v4 and v6. The second config can be replace 0 with 1.

Thanks, this is good to know. Forwarded that information to eth-docker.

nflaig commented 1 year ago

I removed the container yesterday and restarted it today, so far I am not seeing any increase in system CPU usage. Will keep monitoring this..

Regarding, the process uid, this is likely due to s6-setuidgid but the strange thing is I am not seeing this anymore either.

https://github.com/joshuaavalon/docker-cloudflare/blob/fcebb8a1dff3930b1f968a18e116b38725affc46/docker/root/app/cloudflare.sh#L8

Maybe was some strange behavior when updating from 3.0.0 to 3.0.1.

nflaig commented 1 year ago

OK, I start seeing it again, my system CPU spiked around 7pm, here are logs (socker hang up looks suspicious)

> cloudflare-ddns@3.0.1 start:pretty
> node index.mjs | pino-pretty --translateTime yyyy-mm-dd'T'HH:MM:sso

[2023-06-20P18:45:01+0000] INFO (4158): Cloudflare DDNS start
[2023-06-20P18:45:25+0000] ERROR (4158): Failed to update <masked>. (Client network socket disconnected before secure TLS connection was established)
[2023-06-20P18:45:25+0000] INFO (4158): Cloudflare DDNS end

> cloudflare-ddns@3.0.1 start:pretty
> node index.mjs | pino-pretty --translateTime yyyy-mm-dd'T'HH:MM:sso

[2023-06-20P18:50:01+0000] INFO (4192): Cloudflare DDNS start
[2023-06-20P18:50:23+0000] ERROR (4192): Failed to update <masked>. (socket hang up)
[2023-06-20P18:50:23+0000] INFO (4192): Cloudflare DDNS end

> cloudflare-ddns@3.0.1 start:pretty
> node index.mjs | pino-pretty --translateTime yyyy-mm-dd'T'HH:MM:sso

[2023-06-20P18:55:01+0000] INFO (4226): Cloudflare DDNS start
[2023-06-20P18:55:17+0000] INFO (4226): Skipped updating.
[2023-06-20P18:55:17+0000] INFO (4226): Updated <masked> with <masked>
[2023-06-20P18:55:17+0000] INFO (4226): Cloudflare DDNS end

> cloudflare-ddns@3.0.1 start:pretty
> node index.mjs | pino-pretty --translateTime yyyy-mm-dd'T'HH:MM:sso

[2023-06-20P19:00:01+0000] INFO (4262): Cloudflare DDNS start
[2023-06-20P19:00:10+0000] INFO (4262): Skipped updating.
[2023-06-20P19:00:10+0000] INFO (4262): Updated <masked> with <masked>
[2023-06-20P19:00:10+0000] INFO (4262): Cloudflare DDNS end

> cloudflare-ddns@3.0.1 start:pretty
> node index.mjs | pino-pretty --translateTime yyyy-mm-dd'T'HH:MM:sso

[2023-06-20P19:05:01+0000] INFO (4296): Cloudflare DDNS start
[2023-06-20P19:05:07+0000] INFO (4296): Skipped updating.
[2023-06-20P19:05:07+0000] INFO (4296): Updated <masked> with <masked>
[2023-06-20P19:05:07+0000] INFO (4296): Cloudflare DDNS end

This is the process that spawned for my host user that causes the CPU usage

> ps -p 4086735 -o pid,user,comm,lstart,etime,args

PID USER     COMMAND                          STARTED     ELAPSED COMMAND
4086735 nico     npm run start:p Tue Jun 20 16:54:59 2023    02:41:37 npm run start:pretty

---
> ps -p 4086747 -o pid,user,comm,lstart,etime,args

PID USER     COMMAND                          STARTED     ELAPSED COMMAND
4086747 nico     node            Tue Jun 20 16:55:00 2023    02:45:03 node index.mjs

---
> ps -p 4086747 -o %cpu,%mem,cmd

%CPU %MEM CMD
99.8  0.1 node index.mjs

joshuaavalon commented 1 year ago

Since I still cannot reproduce the issue, I would like to ask for your help.

Use ghcr.io/joshuaavalon/cloudflare-ddns:feat-detail-log. I added many debug log to different functions calls and a process id each time it starts.
Use CF_DNS__LOG_LEVEL=debug

You should able to see something like the following log

 [2023-06-21A02:51:01+0000] DEBUG (269): Start
     processId: "c2578ffe-5712-4b18-8ab6-328547696597"
    method: "requestWebhook"

or if you want to grep by processId, you can use CF_DNS__LOG_TYPE=json and it prints json log instead.

{"level":50,"time":1687315899968,"pid":235,"hostname":"5c04695bba47","processId":"78ef9502-75f6-4416-8f7f-df4014ad4f10","method":"updateDnsRecords","msg":"Failed to update"}

I would like to ask you to look for the log during the CPU spike and check the following things:

Use docker compose exec cf-ddns-v4 bash to enter the docker
1. Does it have one or multiple processes running?
2. What if you kill the process?
Does the problem persist after you restart the container?
Does the spike happen during fixed time period?
What is the last logged message for the process? Is it always the same? (Assume if it stops at some places)
Does it still happen if you use CF_DNS__LOG_TYPE=json?

nflaig commented 1 year ago

Since I still cannot reproduce the issue

So either this only happens because I am running 2 containers (which would still be a bug) or this is related to the socket hangup which makes it hard to reproduce.

I would like to ask for your help

Glad to help debug this, thanks for providing the image. I deployed it now with the configs you mentioned,

check the following things

Let's wait and see, will report back here once issue happens again

nflaig commented 1 year ago

Issue happened again, I don't see anything suspicious in the logs though.

Regarding your questions:

Does it have one or multiple processes running?

These are the processes inside the container

PID   USER     TIME  COMMAND
    1 root      0:00 s6-svscan -t0 /var/run/s6/services
   35 root      0:00 s6-supervise s6-fdholderd
  207 root      0:00 s6-supervise cloudflare-ddns
  210 root      0:00 sh ./run
  247 root      0:00 crond -f
 1975 node      0:00 npm run start:json
 1986 node      5h25 node index.mjs
 3432 root      0:00 bash
 3439 root      0:00 ps

What if you kill the process?

Killing the process (1986) fixes the issue

Does the problem persist after you restart the container?

No, I tested this multiple times, it is fixed for a while but then happens again.

Does the spike happen during fixed time period?

It happens 24/7 until process is killed.

What is the last logged message for the process? Is it always the same? (Assume if it stops at some places)

No suspicious log found

Does it still happen if you use CF_DNS__LOG_TYPE=json?

yes

Those answers are probably not that helpful, I will try to find out what the process does using some others tools.

joshuaavalon commented 1 year ago

Does it happen to IPv4 or IPv6? May be you can change out the default ip service?

The default for IPv4 are

IPv6

nflaig commented 1 year ago

Does it happen to IPv4 or IPv6?

Happens on both, when I initially reported it there were 2 processes.

Since I killed the process it didn't happen again, it has been 24h.

yorickdowne commented 1 year ago

I have this reproduced with two containers. Restarting with one, let's see how it does.

It took one entire core, but no more.

"Two services" may not be related. On another box I've been up for 8 days with two services and it doesn't take a full core.

nflaig commented 1 year ago

Collected a bit more information on what the process is doing using perf + inferno (see https://github.com/ChainSafe/lodestar/issues/5604#issuecomment-1581091711 for full instructions) but the results or not that helpful either. Would have to add --perf-basic-prof flag to node to get more details on the [unknown].

flame-graph

A few guesses on what could be the issue:

the yellow/orange [unknown] might be functions that are throwing (loop retrying something), probably index.mjs and main, would have to add --perf-basic-prof to get more details
there are mutex's involved so its something at the native layer that is getting accessed repeatedly, perhaps the socket
possibly an uncaught exception due to network error or something transient that gets called recursively
could be this function call but haven't looked further into it https://github.com/joshuaavalon/docker-cloudflare/blob/fcebb8a1dff3930b1f968a18e116b38725affc46/packages/app/src/index.ts#L58

joshuaavalon commented 1 year ago

@nflaig That function should not run if you did not define any webhook

https://github.com/joshuaavalon/docker-cloudflare/blob/5b4fd915c9fa2017cac2633186b1b9c29de299d9/packages/app/src/index.ts#L28-L50

If you are using ghcr.io/joshuaavalon/cloudflare-ddns:feat-detail-log, you should able to see the url that it is requesting in the debug log.

nflaig commented 1 year ago

That function should not run if you did not define any webhook

You might be right, I haven't looked further into it if this is really the case.

eth-docker now just uses 1 container for both ipv4 and ipv6 (traefik-cf.yml#L51) but I am still seeing this issue. Disabled the container for now on my end.

I might take another look at it over the weekend if I find the time.

Fr33loader commented 1 year ago

I'm having the same issue: sudden and continued CPU usage of 300%. I restarted the container and it went back to normal, until it went to 100% again the next day (again for a continued period) until I restarted.

No anomalies in the logs that I could find, just an endless repetition of:

> cloudflare-ddns@3.0.0 start:pretty
> node index.mjs | pino-pretty --translateTime yyyy-mm-dd'T'HH:MM:sso

[2023-07-04A07:45:01+0000] INFO (273): Cloudflare DDNS start
[2023-07-04A07:45:10+0000] INFO (273): Skipped updating.
[2023-07-04A07:45:10+0000] INFO (273): Updated [mydomain] with [my IP]
[2023-07-04A07:45:10+0000] INFO (273): Cloudflare DDNS end

joshuaavalon commented 1 year ago

A new version v3.1.0 has been released. I suspect it may be related to s6-overlay. Please see if there are still any issues,

nflaig commented 1 year ago

Please see if there are still any issues

Deployed, s6-overlay seems like a good candidate since the process was also spawned for my host user. Not sure when I can confirm that the issue is fixed, guess have to at least wait for 1 week.

Fr33loader commented 1 year ago

So far so good on my end.

nflaig commented 1 year ago

It happened again with v3.1.0. The difference now seems to be that the process is executed by root instead of my host user.

> ps -p 1032708 -o pid,user,comm,lstart,etime,args
PID USER     COMMAND                          STARTED     ELAPSED COMMAND
1032708 root     node            Thu Jul  6 23:00:00 2023    09:10:48 node index.mjs

> ps -p 1032708 -o %cpu,%mem,cmd
%CPU %MEM CMD
99.9  0.1 node index.mjs

joshuaavalon commented 1 year ago

How about docker stats? Does it also report 100%?

nflaig commented 1 year ago

How about docker stats? Does it also report 100%?

Yes, same result there

CONTAINER ID   NAME                                  CPU %     MEM USAGE / LIMIT     MEM %     NET I/O           BLOCK I/O         PIDS
cca82c9f3372   mainnet-cf-ddns-1                     100.28%   120.9MiB / 31.06GiB   0.38%     4.23MB / 1.9MB    124MB / 6.03MB    37

Fr33loader commented 1 year ago

Idem on my end: once again 100 to 200% CPU usage since yesterday.

CONTAINER ID   NAME              CPU %     MEM USAGE / LIMIT   MEM %     NET I/O           BLOCK I/O       PIDS
e1d1235076b7   cloudflare-ddns   200.49%   145MiB / 31.08GiB   0.46%     6.54MB / 2.21MB   281MB / 113MB   72

joshuaavalon commented 1 year ago

I have published a Debian based image. The tag have -debian suffix. Can you try the Debian image?

nflaig commented 1 year ago

Can you try the Debian image?

Deployed

nflaig commented 1 year ago

Deployed

~~No issues so far but should probably test this a few more days and maybe someone else can test as well.~~

~~Is there a solid reasoning behind why the Debian image should fix the issue?~~

Edit: Issue happens with the Debian image as well

joshuaavalon commented 1 year ago

I have a test imagefeat-axios (feat-axios-debian) that removes axios and use native fetch only. Also, I have add timeout for fetch.

Can you try it?

nflaig commented 1 year ago

Can you try it?

Deployed feat-axios

nflaig commented 1 year ago

Deployed feat-axios

so far so good on my end, anyone else running this image?

nflaig commented 1 year ago

It has been over a week now and still no problems, if you do another release with fixes from feat-axios I think we can consider this issue resolved.

Fr33loader commented 1 year ago

All seems to be well on my side as well with the new release.

nflaig commented 1 year ago

Deployed version 3.2.3 yesterday and I am seeing the issue again, feat-axios did not have it for a week but I guess the image itself was not working properly.

joshuaavalon commented 1 year ago

Can you check if it is similar to https://github.com/pi-hole/docker-pi-hole/issues/1042#issuecomment-1086728157?

See if it is related to the version libseccomp2.

yorickdowne commented 1 year ago

Doesn't use libseccomp from what I can see

/app # apk list -I | grep libsec
WARNING: opening from cache https://dl-cdn.alpinelinux.org/alpine/v3.18/main: No such file or directory
WARNING: opening from cache https://dl-cdn.alpinelinux.org/alpine/v3.18/community: No such file or directory
/app #

It's an alpine image

If it were to use it, it'd be a recent version

/app # apk add libseccomp
fetch https://dl-cdn.alpinelinux.org/alpine/v3.18/main/x86_64/APKINDEX.tar.gz
fetch https://dl-cdn.alpinelinux.org/alpine/v3.18/community/x86_64/APKINDEX.tar.gz
(1/1) Installing libseccomp (2.5.4-r2)
Executing busybox-1.36.1-r0.trigger
OK: 12 MiB in 22 packages

joshuaavalon commented 1 year ago

@yorickdowne You should look at the link in the comment. https://docs.linuxserver.io/faq

My host is incompatible with images based on Ubuntu Focal and Alpine 3.13 and later This only affects 32 bit installs of distros based on Debian Buster.

This is due to a bug in the libseccomp2 library (dependency of Docker itself), which is fixed. However, it's not pushed to all the repositories.

It is talking about host. I have checkout my host which uses 2.5.1 and I did not have the problem.

yorickdowne commented 1 year ago

Oh, wow. Let me see!

I'll say "probably not". I have seen this behavior and I am on either Ubuntu 22.04 (libseccomp2 2.5.3) or Debian 12 (libseccomp2 2.5.4).

yorickdowne commented 1 year ago

This issue is somewhat rare. I have three machines right now where I'm seeing it; one of those is on Debian 12, libseccomp2 2.5.4, docker 24.0.5

I'll leave it in this state. What data can I get you?

joshuaavalon commented 1 year ago

From the issue above, it should only happen in old version of libseccomp.

yorickdowne commented 1 year ago

Which means this is not that same issue. Whatever is going on here, is not related to the issue that pihole encountered.

joshuaavalon commented 1 year ago

Can you check is your system using a 32 or 64 bits OS?

yorickdowne commented 1 year ago

Always 64-bit. These are all pretty simple systems: Either baremetal (OVH) or an AWS/GCP instance, with usually Debian 12 and some Ubuntu 22.04.

Sample Debian uname

debian@avalanche-a:~$ uname -a
Linux avalanche-a 6.1.0-10-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.38-1 (2023-07-14) x86_64 GNU/Linux

Sample Ubuntu uname

ubuntu@fantom-c:~$ uname -a
Linux fantom-c 5.15.0-78-generic #85-Ubuntu SMP Fri Jul 7 15:25:09 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

yorickdowne commented 1 year ago

These overwhelmingly use docker-ce, though a few are on docker.io. docker.io is 10.20.x, and docker-ce currently 24.0.x

The number of systems that show the issue is a simple matter of time. 3 days ago it was 3, and I restarted 2 of them; now it is 8.

joshuaavalon commented 1 year ago

Can you test feat-docker branch image? It updated to latest Node.js which has updates on fetch

nflaig commented 1 year ago

@joshuaavalon I would suggest to get another perf profile (https://github.com/joshuaavalon/docker-cloudflare/issues/79#issuecomment-1606128041), could you publish an image with --perf-basic-prof flag?

joshuaavalon commented 1 year ago

@nflaig You should able to use add --perf-basic-prof to NODE_OPTIONS in the environment variables.

nflaig commented 1 year ago

@joshuaavalon yeah right, that's a bit simpler than another build

With NODE_OPTIONS=--perf-basic-prof set

out

This is sadly not much more helpful then the previous flame graph.

yorickdowne commented 1 year ago

feat-docker may have done something good, or I haven't given it enough time. I've deployed it fairly widely and haven't seen the issue since.

leomeinel commented 1 year ago

I can confirm this behaviour on the current ghcr.io/joshuaavalon/cloudflare-ddns:latest docker image. I didn't investigate further, but 100% cpu usage on one core can be seen after 6 days uptime. It doesn't seem like there's any real cause.

When running htop on the host system, I can see 100% cpu usage for this process:

# parent process
sh -c node index.mjs | pino-pretty --translateTime yyyy-mm-dd'T'HH:MM:sso
# child process with 100% cpu usage
node index.mjs

yorickdowne commented 1 year ago

Can you try feat-docker tag and see how it behaves @leomeinel ? That one updates node

leomeinel commented 1 year ago

Can you try feat-docker tag and see how it behaves @leomeinel ? That one updates node

I'll test that tag in the following days, thanks for the suggestion. Are there plans to merge this into latest/main if it resolves this issue?

nflaig commented 1 year ago

I deployed feat-docker on my end, thanks @yorickdowne for adding the envs in eth-docker :100:

joshuaavalon commented 1 year ago

I can merge if you don't have any problems.

joshuaavalon / docker-cloudflare