Open nflaig opened 1 year ago
I cannot reproduce the issue. Can you provide the log?
Also, you do not need two containers for v4 and v6. The second config can be replace 0
with 1
.
- CF_DNS__DOMAINS_0__ZONE_NAME=${DOMAIN}
- CF_DNS__DOMAINS_0__ZONE_ID=${CF_ZONE_ID:-}
- CF_DNS__DOMAINS_0__NAME=${DDNS_SUBDOMAIN}.${DOMAIN}
- CF_DNS__DOMAINS_0__PROXIED=${DDNS_PROXY}
- CF_DNS__DOMAINS_0__CREATE=true
- CF_DNS__DOMAINS_0__TYPE=A
- CF_DNS__DOMAINS_1__ZONE_NAME=${DOMAIN}
- CF_DNS__DOMAINS_1__ZONE_ID=${CF_ZONE_ID:-}
- CF_DNS__DOMAINS_1__NAME=${DDNS_SUBDOMAIN}.${DOMAIN}
- CF_DNS__DOMAINS_1__PROXIED=${DDNS_PROXY}
- CF_DNS__DOMAINS_1__CREATE=true
- CF_DNS__DOMAINS_1__TYPE=AAAA
I cannot reproduce the issue. Can you provide the log?
I will dig into this later, just dropped issue yesterday because I thought maybe it is something obvious in the code.
Also, you do not need two containers for v4 and v6. The second config can be replace 0 with 1.
Thanks, this is good to know. Forwarded that information to eth-docker.
I removed the container yesterday and restarted it today, so far I am not seeing any increase in system CPU usage. Will keep monitoring this..
Regarding, the process uid, this is likely due to s6-setuidgid
but the strange thing is I am not seeing this anymore either.
Maybe was some strange behavior when updating from 3.0.0 to 3.0.1.
OK, I start seeing it again, my system CPU spiked around 7pm, here are logs (socker hang up looks suspicious)
> cloudflare-ddns@3.0.1 start:pretty
> node index.mjs | pino-pretty --translateTime yyyy-mm-dd'T'HH:MM:sso
[2023-06-20P18:45:01+0000] INFO (4158): Cloudflare DDNS start
[2023-06-20P18:45:25+0000] ERROR (4158): Failed to update <masked>. (Client network socket disconnected before secure TLS connection was established)
[2023-06-20P18:45:25+0000] INFO (4158): Cloudflare DDNS end
> cloudflare-ddns@3.0.1 start:pretty
> node index.mjs | pino-pretty --translateTime yyyy-mm-dd'T'HH:MM:sso
[2023-06-20P18:50:01+0000] INFO (4192): Cloudflare DDNS start
[2023-06-20P18:50:23+0000] ERROR (4192): Failed to update <masked>. (socket hang up)
[2023-06-20P18:50:23+0000] INFO (4192): Cloudflare DDNS end
> cloudflare-ddns@3.0.1 start:pretty
> node index.mjs | pino-pretty --translateTime yyyy-mm-dd'T'HH:MM:sso
[2023-06-20P18:55:01+0000] INFO (4226): Cloudflare DDNS start
[2023-06-20P18:55:17+0000] INFO (4226): Skipped updating.
[2023-06-20P18:55:17+0000] INFO (4226): Updated <masked> with <masked>
[2023-06-20P18:55:17+0000] INFO (4226): Cloudflare DDNS end
> cloudflare-ddns@3.0.1 start:pretty
> node index.mjs | pino-pretty --translateTime yyyy-mm-dd'T'HH:MM:sso
[2023-06-20P19:00:01+0000] INFO (4262): Cloudflare DDNS start
[2023-06-20P19:00:10+0000] INFO (4262): Skipped updating.
[2023-06-20P19:00:10+0000] INFO (4262): Updated <masked> with <masked>
[2023-06-20P19:00:10+0000] INFO (4262): Cloudflare DDNS end
> cloudflare-ddns@3.0.1 start:pretty
> node index.mjs | pino-pretty --translateTime yyyy-mm-dd'T'HH:MM:sso
[2023-06-20P19:05:01+0000] INFO (4296): Cloudflare DDNS start
[2023-06-20P19:05:07+0000] INFO (4296): Skipped updating.
[2023-06-20P19:05:07+0000] INFO (4296): Updated <masked> with <masked>
[2023-06-20P19:05:07+0000] INFO (4296): Cloudflare DDNS end
This is the process that spawned for my host user that causes the CPU usage
> ps -p 4086735 -o pid,user,comm,lstart,etime,args
PID USER COMMAND STARTED ELAPSED COMMAND
4086735 nico npm run start:p Tue Jun 20 16:54:59 2023 02:41:37 npm run start:pretty
---
> ps -p 4086747 -o pid,user,comm,lstart,etime,args
PID USER COMMAND STARTED ELAPSED COMMAND
4086747 nico node Tue Jun 20 16:55:00 2023 02:45:03 node index.mjs
---
> ps -p 4086747 -o %cpu,%mem,cmd
%CPU %MEM CMD
99.8 0.1 node index.mjs
Since I still cannot reproduce the issue, I would like to ask for your help.
ghcr.io/joshuaavalon/cloudflare-ddns:feat-detail-log
. I added many debug log to different functions calls and a process id each time it starts.CF_DNS__LOG_LEVEL=debug
You should able to see something like the following log
[2023-06-21A02:51:01+0000] DEBUG (269): Start
processId: "c2578ffe-5712-4b18-8ab6-328547696597"
method: "requestWebhook"
or if you want to grep by processId
, you can use CF_DNS__LOG_TYPE=json
and it prints json log instead.
{"level":50,"time":1687315899968,"pid":235,"hostname":"5c04695bba47","processId":"78ef9502-75f6-4416-8f7f-df4014ad4f10","method":"updateDnsRecords","msg":"Failed to update"}
I would like to ask you to look for the log during the CPU spike and check the following things:
docker compose exec cf-ddns-v4 bash
to enter the docker
CF_DNS__LOG_TYPE=json
?Since I still cannot reproduce the issue
So either this only happens because I am running 2 containers (which would still be a bug) or this is related to the socket hangup which makes it hard to reproduce.
I would like to ask for your help
Glad to help debug this, thanks for providing the image. I deployed it now with the configs you mentioned,
check the following things
Let's wait and see, will report back here once issue happens again
Issue happened again, I don't see anything suspicious in the logs though.
Regarding your questions:
Does it have one or multiple processes running?
These are the processes inside the container
PID USER TIME COMMAND
1 root 0:00 s6-svscan -t0 /var/run/s6/services
35 root 0:00 s6-supervise s6-fdholderd
207 root 0:00 s6-supervise cloudflare-ddns
210 root 0:00 sh ./run
247 root 0:00 crond -f
1975 node 0:00 npm run start:json
1986 node 5h25 node index.mjs
3432 root 0:00 bash
3439 root 0:00 ps
What if you kill the process?
Killing the process (1986) fixes the issue
Does the problem persist after you restart the container?
No, I tested this multiple times, it is fixed for a while but then happens again.
Does the spike happen during fixed time period?
It happens 24/7 until process is killed.
What is the last logged message for the process? Is it always the same? (Assume if it stops at some places)
No suspicious log found
Does it still happen if you use CF_DNS__LOG_TYPE=json?
yes
Those answers are probably not that helpful, I will try to find out what the process does using some others tools.
Does it happen to IPv4 or IPv6? May be you can change out the default ip service?
The default for IPv4 are
IPv6
Does it happen to IPv4 or IPv6?
Happens on both, when I initially reported it there were 2 processes.
Since I killed the process it didn't happen again, it has been 24h.
I have this reproduced with two containers. Restarting with one, let's see how it does.
It took one entire core, but no more.
"Two services" may not be related. On another box I've been up for 8 days with two services and it doesn't take a full core.
Collected a bit more information on what the process is doing using perf + inferno (see https://github.com/ChainSafe/lodestar/issues/5604#issuecomment-1581091711 for full instructions) but the results or not that helpful either. Would have to add --perf-basic-prof
flag to node to get more details on the [unknown].
A few guesses on what could be the issue:
--perf-basic-prof
to get more details@nflaig That function should not run if you did not define any webhook
If you are using ghcr.io/joshuaavalon/cloudflare-ddns:feat-detail-log
, you should able to see the url that it is requesting in the debug log.
That function should not run if you did not define any webhook
You might be right, I haven't looked further into it if this is really the case.
eth-docker now just uses 1 container for both ipv4 and ipv6 (traefik-cf.yml#L51) but I am still seeing this issue. Disabled the container for now on my end.
I might take another look at it over the weekend if I find the time.
I'm having the same issue: sudden and continued CPU usage of 300%. I restarted the container and it went back to normal, until it went to 100% again the next day (again for a continued period) until I restarted.
No anomalies in the logs that I could find, just an endless repetition of:
> cloudflare-ddns@3.0.0 start:pretty
> node index.mjs | pino-pretty --translateTime yyyy-mm-dd'T'HH:MM:sso
[2023-07-04A07:45:01+0000] INFO (273): Cloudflare DDNS start
[2023-07-04A07:45:10+0000] INFO (273): Skipped updating.
[2023-07-04A07:45:10+0000] INFO (273): Updated [mydomain] with [my IP]
[2023-07-04A07:45:10+0000] INFO (273): Cloudflare DDNS end
A new version v3.1.0 has been released. I suspect it may be related to s6-overlay. Please see if there are still any issues,
Please see if there are still any issues
Deployed, s6-overlay seems like a good candidate since the process was also spawned for my host user. Not sure when I can confirm that the issue is fixed, guess have to at least wait for 1 week.
So far so good on my end.
It happened again with v3.1.0. The difference now seems to be that the process is executed by root instead of my host user.
> ps -p 1032708 -o pid,user,comm,lstart,etime,args
PID USER COMMAND STARTED ELAPSED COMMAND
1032708 root node Thu Jul 6 23:00:00 2023 09:10:48 node index.mjs
> ps -p 1032708 -o %cpu,%mem,cmd
%CPU %MEM CMD
99.9 0.1 node index.mjs
How about docker stats
? Does it also report 100%?
How about docker stats? Does it also report 100%?
Yes, same result there
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
cca82c9f3372 mainnet-cf-ddns-1 100.28% 120.9MiB / 31.06GiB 0.38% 4.23MB / 1.9MB 124MB / 6.03MB 37
Idem on my end: once again 100 to 200% CPU usage since yesterday.
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
e1d1235076b7 cloudflare-ddns 200.49% 145MiB / 31.08GiB 0.46% 6.54MB / 2.21MB 281MB / 113MB 72
I have published a Debian based image. The tag have -debian
suffix.
Can you try the Debian image?
Can you try the Debian image?
Deployed
Deployed
No issues so far but should probably test this a few more days and maybe someone else can test as well.
Is there a solid reasoning behind why the Debian image should fix the issue?
Edit: Issue happens with the Debian image as well
I have a test imagefeat-axios
(feat-axios-debian
) that removes axios and use native fetch only. Also, I have add timeout for fetch.
Can you try it?
Can you try it?
Deployed feat-axios
Deployed feat-axios
so far so good on my end, anyone else running this image?
It has been over a week now and still no problems, if you do another release with fixes from feat-axios
I think we can consider this issue resolved.
All seems to be well on my side as well with the new release.
Deployed version 3.2.3 yesterday and I am seeing the issue again, feat-axios
did not have it for a week but I guess the image itself was not working properly.
Can you check if it is similar to https://github.com/pi-hole/docker-pi-hole/issues/1042#issuecomment-1086728157?
See if it is related to the version libseccomp2
.
Doesn't use libseccomp from what I can see
/app # apk list -I | grep libsec
WARNING: opening from cache https://dl-cdn.alpinelinux.org/alpine/v3.18/main: No such file or directory
WARNING: opening from cache https://dl-cdn.alpinelinux.org/alpine/v3.18/community: No such file or directory
/app #
It's an alpine image
If it were to use it, it'd be a recent version
/app # apk add libseccomp
fetch https://dl-cdn.alpinelinux.org/alpine/v3.18/main/x86_64/APKINDEX.tar.gz
fetch https://dl-cdn.alpinelinux.org/alpine/v3.18/community/x86_64/APKINDEX.tar.gz
(1/1) Installing libseccomp (2.5.4-r2)
Executing busybox-1.36.1-r0.trigger
OK: 12 MiB in 22 packages
@yorickdowne You should look at the link in the comment. https://docs.linuxserver.io/faq
My host is incompatible with images based on Ubuntu Focal and Alpine 3.13 and later This only affects 32 bit installs of distros based on Debian Buster.
This is due to a bug in the libseccomp2 library (dependency of Docker itself), which is fixed. However, it's not pushed to all the repositories.
It is talking about host. I have checkout my host which uses 2.5.1 and I did not have the problem.
Oh, wow. Let me see!
I'll say "probably not". I have seen this behavior and I am on either Ubuntu 22.04 (libseccomp2 2.5.3) or Debian 12 (libseccomp2 2.5.4).
This issue is somewhat rare. I have three machines right now where I'm seeing it; one of those is on Debian 12, libseccomp2 2.5.4, docker 24.0.5
I'll leave it in this state. What data can I get you?
From the issue above, it should only happen in old version of libseccomp.
Which means this is not that same issue. Whatever is going on here, is not related to the issue that pihole encountered.
Can you check is your system using a 32 or 64 bits OS?
Always 64-bit. These are all pretty simple systems: Either baremetal (OVH) or an AWS/GCP instance, with usually Debian 12 and some Ubuntu 22.04.
Sample Debian uname
debian@avalanche-a:~$ uname -a
Linux avalanche-a 6.1.0-10-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.38-1 (2023-07-14) x86_64 GNU/Linux
Sample Ubuntu uname
ubuntu@fantom-c:~$ uname -a
Linux fantom-c 5.15.0-78-generic #85-Ubuntu SMP Fri Jul 7 15:25:09 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
These overwhelmingly use docker-ce, though a few are on docker.io. docker.io is 10.20.x, and docker-ce currently 24.0.x
The number of systems that show the issue is a simple matter of time. 3 days ago it was 3, and I restarted 2 of them; now it is 8.
Can you test feat-docker
branch image? It updated to latest Node.js which has updates on fetch
@joshuaavalon I would suggest to get another perf profile (https://github.com/joshuaavalon/docker-cloudflare/issues/79#issuecomment-1606128041), could you publish an image with --perf-basic-prof
flag?
@nflaig You should able to use add --perf-basic-prof
to NODE_OPTIONS
in the environment variables.
@joshuaavalon yeah right, that's a bit simpler than another build
With NODE_OPTIONS=--perf-basic-prof
set
This is sadly not much more helpful then the previous flame graph.
feat-docker
may have done something good, or I haven't given it enough time. I've deployed it fairly widely and haven't seen the issue since.
I can confirm this behaviour on the current ghcr.io/joshuaavalon/cloudflare-ddns:latest
docker image.
I didn't investigate further, but 100% cpu usage on one core can be seen after 6 days uptime. It doesn't seem like there's any real cause.
When running htop on the host system, I can see 100% cpu usage for this process:
# parent process
sh -c node index.mjs | pino-pretty --translateTime yyyy-mm-dd'T'HH:MM:sso
# child process with 100% cpu usage
node index.mjs
Can you try feat-docker tag and see how it behaves @leomeinel ? That one updates node
Can you try feat-docker tag and see how it behaves @leomeinel ? That one updates node
I'll test that tag in the following days, thanks for the suggestion. Are there plans to merge this into latest/main if it resolves this issue?
I deployed feat-docker
on my end, thanks @yorickdowne for adding the envs in eth-docker :100:
I can merge if you don't have any problems.
Is there an existing issue for this?
Version
3.0.1
Describe The Bug
Causes 100% CPU utilization 24/7 and spawns process for host user.
CPU usage
User processes
Steps To Reproduce
Run cloudflare-dds via docker, see https://github.com/eth-educators/eth-docker/blob/63b651d56e83d7b513d9350cd00eca3383eecfc0/traefik-cf.yml#L52
Expected Behavior
Should not spawn process for host user and cause 100% CPU utilization 24/7.
ENV Configuration
JS Configuration
No response
Relevant Output Log
No response