pi-hole / FTL

The Pi-hole FTL engine
https://pi-hole.net
Other
1.38k stars 196 forks source link

FTL running at 98% of CPU #239

Closed PromoFaux closed 6 years ago

PromoFaux commented 6 years ago

@nitschkecm commented on Mon Feb 26 2018

In raising this issue, I confirm the following: {please fill the checkboxes, e.g: [X]}

How familiar are you with the the source code relevant to this issue?: 1 Expected behaviour: FTL should be running 2-4% Actual behaviour: FTL is running at 98% of the CPU and makes the Webinterfact unusuable. As well is slowing down the DNS responses Debug token provided by uploading pihole -d log: wl2hdbs4da

DL6ER commented 6 years ago

Why is it running at such a high CPU level? How do the most recent lines in your /var/log/pihole.log and /var/log/pihole-FTL.log look like?

Are you running on low-power hardware or beefier hardware? How many queries are incoming per second (round about)?

nitschkecm commented 6 years ago

Hi /var/log/pihole.log Mar 2 17:20:19 dnsmasq[670]: 316133 192.168.1.35/61829 query[A] www--crtm--es.accesible.inclusite.com from 192.168.1.35 Mar 2 17:20:19 dnsmasq[670]: 316133 192.168.1.35/61829 forwarded www--crtm--es.accesible.inclusite.com to 80.58.61.254 Mar 2 17:20:19 dnsmasq[670]: 316133 192.168.1.35/61829 forwarded www--crtm--es.accesible.inclusite.com to 80.58.61.250 Mar 2 17:20:19 dnsmasq[670]: 316133 192.168.1.35/61829 reply www--crtm--es.accesible.inclusite.com is Mar 2 17:20:19 dnsmasq[670]: 316133 192.168.1.35/61829 reply accesible.inclusite.com is Mar 2 17:20:19 dnsmasq[670]: 316133 192.168.1.35/61829 reply accesible-es.inclusite.com is 109.234.82.110

/var/log/pihole-FTL.log [2018-03-02 12:30:50.407] Notice: Increasing queries struct size from 100000 to 110000 (4.94 MB) [2018-03-02 12:44:46.691] Notice: Increasing queries struct size from 110000 to 120000 (5.38 MB) [2018-03-02 15:42:01.177] Notice: Increasing domains struct size from 2000 to 3000 (5.40 MB) [2018-03-02 16:58:02.711] Notice: Increasing queries struct size from 120000 to 130000 (5.85 MB) [2018-03-02 17:15:09.189] Notice: Increasing queries struct size from 130000 to 140000 (6.29 MB)

The Hardware is a Raspberry pi 2B+ was running great until I upgrade to 3.3 FTL

AzureMarker commented 6 years ago

@nitschkecm Do you still have FTL at a high CPU usage? Make a new debug token.

nitschkecm commented 6 years ago

@Mcat12 I have swapped the HW for a RP 3 and for the moment it seemed to run fine. I have just checked and we are back at the 100%. I am running the pihole -d now

Update: here is the token: v00q0j2o9k

Another update when FTL Crashed it had 40.000 DNS requests... all from one IP.

AzureMarker commented 6 years ago

If you run pihole -t, are there lots of requests coming in? What device is sending so many requests? (sorry, was too late to get to the token before it expired)

nitschkecm commented 6 years ago

I have taken a deeper look since it is a device on my network who caused the brutal amount of dns requests. I have made an update on the box and watching it for the moment. However it seems like that once the FTL gets too many requests it can crash.

rwwest77 commented 6 years ago

Was going to start a new issue but looks like this is the same.

We have a large network, thousands of devices. Millions of requests per day. Our FTL crashes about an hour into the day and the GUI is unusable until the end of the day when things settle down again.

So the FTL does crash on many requests. So, we loaded it on beefy hardware: Dell Poweredge with 24 cores and 144 GB RAM. Performance is no better though.

Using htop, the pihole-FTL service will only use one of the 24 cores. It will max that one core out to 100%, but the other 23 sit idle.

How do we make it use multiple cores?

DL6ER commented 6 years ago

Using htop you can show thread names, see screenshot at 2018-03-20 19-36-19

With this we can see which thread is causing the issues. I think it will be logparser and if that is the case then there isn't much we can do about it right now as your queries are apparently coming in at a too high rate and the upstream servers are responding in a too unstructured manner. The situation should improve with FTL v4.0, which is in beta testing phase, because we decouple from the log file and integrate into the DNS resolver directly.

Are you using FTL v3.0 (Pi-hole v3.3)? Did the issue also exist before v3.3?

We tested it on a 2 cores, 16 GB node and saw it being able to handle several hundreds of millions of queries. However, all of them were generated from a single testing client and all of them were just dummy domains, so we didn't fully cover your use case (which is quite extreme, but should, of course, still work).

rwwest77 commented 6 years ago

We are using FTL 3.0 and Pi-hole 3.3. This is the only version we've ever used, only been using about a week.

It works great and does it's job heavy duress, just the GUI is unresponsive. All command line stuff still works.

I attached a screenshot of my htop.

Is it possible to make it use more than one core, at the moment?

screen shot 2018-03-20 at 2 53 24 pm

DL6ER commented 6 years ago

I attached a screenshot of my htop.

Press F5 to have the tree view (to see the individual threads and their names).

Is it possible to make it use more than one core, at the moment?

No, this is not possible. Unfortunately, the log parsing is a non-parallelizable task.

DL6ER commented 6 years ago

Closing due to age of issue. Feel free to re-open or create a new issue report if the problem persists.

viktak commented 6 years ago

I have the same issue as reported above. Using: Pi-hole Version vDev (HEAD, v3.3.1-0-gfbee18e) Web Interface Version vDev (HEAD, v3.3-0-ge48aa295) FTL Version v3.0

I noticed that after a fresh start, for a few hours (sometimes for only a few minutes), one of the cores is at 100% (pihole-FTL). Then this issue goes away and it works as expected. After several hours, though, it loses connection with API. This causes none of the charts etc to be working, although the actual blocking of black domains works.

Is there any way to fix this? Or a way to go back to a previous version which was working?

AzureMarker commented 6 years ago

Run pihole -d for a debug token.

rwwest77 commented 6 years ago

It's the log parser falling behind due to too much traffic for that hardware.

We ran it in a large environment for a while (7,000 users). We noticed that when things got busy (during peak network hours) the log parser would fall behind. It only uses 1 core no matter how many you have, so that core would run at 100% while it was trying to catch up.

Eventually, if you just let it run, it does catch up once the network slows down. It's good that the DNS still works, just can't view charts and stuff during busy network times. You can also use CLI to do things when it's overloaded.

We eventually just accepted it as a limitation of the software and split the load across multiple servers. Hopefully future version can multi-thread the log parser so we can run it on one big server instead of managing multiple instances.

AzureMarker commented 6 years ago

If you're on the latest development build, then there is no log parser, it runs during DNS resolution and can't fall behind.

The single core is mostly a limitation of dnsmasq. The API calls will use multiple cores, although there is some strict data locking in order to keep consistency (we hope to eventually allow concurrent reads, speeding up API calls).

viktak commented 6 years ago

This still doesn't explain why, after it caught up with the parsing and working just fine for a few hours, sometimes days, it says on the dashboard that it lost connection with the API. Anyway, I restarted it this morning, was at 100% for a while, and it's been running OK since. When it loses connection with the API, I'll generate a debug log. Also, I will try to use the latest development build as @Mcat12 recommended.

Thanks to all of you for the quick replies!

viktak commented 6 years ago

I forgot to mention, that this is my home office setup where I only have less than 10 computers and maybe 10 or so IoT devices. the total number of queries per day is usually under one million.

viktak commented 6 years ago

Overnight at some stage it got disconnected again. This is the debug token: ikzmuu065k

AzureMarker commented 6 years ago

You don't have IPv6 connectivity, but you enabled it in Pi-hole. Run pihole -r to reconfigure without IPv6.

Did you at some point run pihole checkout core v3.3.1 and a similar command for web? You should run pihole checkout master to receive updates and possibly fix some issues you are seeing.

If it still doesn't work, share your full FTL log /var/log/pihole-FTL.log

viktak commented 6 years ago

@Mcat12 I don't recall issuing a similar command, I have been using it with (mostly) default settings. However, when I issued the command pihole checkout master I got a ton of error messages, so I uninstalled the whole thing using pihole uninstall, then I installed it again with curl -sSL https://install.pi-hole.net | bash, then I also ran pihole checkout master and I switched off IPv6 as suggested. Now everything works fine, will see it in a day or so if it stays like this.

Thanks for your help!