gravity.sh does not validate domains

david-kracht commented 6 years ago

In raising this issue, I confirm the following: {please fill the checkboxes, e.g: [X]}

[X] I have read and understood the contributors guide.
[X] The issue I am reporting can be replicated.
[X] The issue I am reporting isn't a duplicate (see FAQs, closed issues, and open issues).

How familiar are you with the the source code relevant to this issue?: 5

Expected behaviour: Invalid domains should be rejected/ignored.

Actual behaviour: Invalid domains in block lists can cause error with dnsmasq conf files.

Steps to reproduce: Insert a non valid domain in the blocklists and run the gravity.sh. Reassesment of my twisted gravity.list, see the following logs from dnsmasq:

> Dec 21 14:48:07 dnsmasq[19541]: bad name at /etc/pihole/gravity.list line 9
> Dec 21 14:48:07 dnsmasq[19541]: bad name at /etc/pihole/gravity.list line 10
> Dec 21 14:48:13 dnsmasq[19541]: bad name at /etc/pihole/gravity.list line 355079
> Dec 21 14:48:18 dnsmasq[19541]: bad name at /etc/pihole/gravity.list line 626527
> Dec 21 14:48:19 dnsmasq[19541]: bad name at /etc/pihole/gravity.list line 667778
> Dec 21 14:48:19 dnsmasq[19541]: bad name at /etc/pihole/gravity.list line 718960
> Dec 21 14:48:24 dnsmasq[19541]: bad name at /etc/pihole/gravity.list line 947709
> Dec 21 14:48:24 dnsmasq[19541]: bad name at /etc/pihole/gravity.list line 960111
> Dec 21 14:48:25 dnsmasq[19541]: bad name at /etc/pihole/gravity.list line 1014685
> Dec 21 14:48:27 dnsmasq[19541]: bad name at /etc/pihole/gravity.list line 1102943

> #sed -n '9p;10p;355079p;626527p;667778p;718960p;947709p;960111p;1014685p;1102943p' /etc/pihole/gravity.list

> 192.168.2.10 .doubleclick.com
> 192.168.2.10 .doubleclick.net
> 192.168.2.10 free‐celebrity‐tube.com
> 192.168.2.10 p7jnqjos1k.קבלןבניין.com
> 192.168.2.10 public‐sluts.net
> 192.168.2.10 secret.ɢoogle.com
> 192.168.2.10 www.free‐celebrity‐tube.com
> 192.168.2.10 www.huala▒e.cl
> 192.168.2.10 www.public‐sluts.net
> 192.168.2.10 ɢoogle.com

Debug token provided by uploading pihole -d log:

yqltpy1g7c

Troubleshooting undertaken, and/or other relevant information:

A regex expression ~~, e.g. (?:[a-zA-Z0-9](?:[a-zA-Z0-9\-]*[a-zA-Z0-9])?\.)+[a-zA-Z]+~~ might be used to validate domains.

A more refined regex ~~, i.e. '(\@\@)?(([\|]{0,2})(https?:\/\/)?(?:[a-zA-Z0-9](?:[a-zA-Z0-9\-]*[a-zA-Z0-9])?\.)+[a-zA-Z]+)($| |\^|\/\|)(?![^#\!]*\$)~~ might simplify the filtering for valid expressions for host files and adblock lists, too. I am investigating in the improvement, but I have not found the final solution now. Maybe I will open a feature request beyond the stated bug here, later.

stonedbovines commented 6 years ago

There is a regex on stackoverflow by Andrew Domaszek (et al, adding to mkyong's earlier entry) that successfully invalidates those non-domains. It also provides an explanation of its working.

david-kracht commented 6 years ago

+1 from me. It seems that my proposed regex is not general enough: it rejects g.co i would go with the most general expression for filtering.

dschaper commented 6 years ago

@stonedbovines Do you want to code that and submit a Pull Request with the regex changes?

stonedbovines commented 6 years ago

I'm not convinced that the performance hit to validate (in my case) 140K domains to catch a few handfuls of rogue entries, that shouldn't be there anyway, is worth it. Not using the block list responsible for them seems to be a better solution.

EDIT: [Ignore this. Was due to poor implementation.] I timed the validation of my 140K domains on an otherwise idle pi3. It took over 35 minutes. Now I think that almost 67 validations per second is pretty damn good, but a 35 minute task is not the answer.

david-kracht commented 6 years ago

what? i cant believe that a grep can take that long? my rp3 need 13s to grep my first guess of a regex on 3.6 million domains. i believe in some overhead for additional text generation, but i would assume sub-minute time for a million entries. dont know.

stonedbovines commented 6 years ago

I was actually using call outs to a perl script (passing a single domain) because, for my original purposes, I was only ever going to be testing one possible domain at a time.

So, yes, grep it as a whole and it became a sub-second process. Mea culpa!

dschaper commented 6 years ago

With the way grep parses, you might even pick up some performance if you set the environment to C. A little tip I just found on Stack mentions running in parallel if you have multiple cores. Take a look at https://stackoverflow.com/questions/9066609/fastest-possible-grep for a bunch of optimizations.

WaLLy3K commented 6 years ago

When manually adding a domain, we already perform domain validation. As for "invalid" domains pulled from a remote list, there's a couple of considerations:

First and foremost, the list maintainer should be making lists available in HOSTS format, which means domain validation as well as punycode conversion. If you notice an issue with a third party list, the maintainer should be informed about it!

If we need to compensate for ignorant list maintainers, the appropriate place to add a fix would be near this block of code as Pi-hole's query functionality needs to be able to search for the correct domains from individual *.domains lists, opposed to query only finding the "invalid" domains. By having the parsing workaround near this block of code, Pi-hole will only need to parse one or two lists that have errors, instead of the entire gravity.list.

david-kracht commented 6 years ago

maintainer should be informed about it!

i totally agree, but the common user should have the option/information to see that there is something wrong with the list they added. a feedback can not be wrong to shorten (and improve) maintaining cycles for the blocklists. how should the "ordinary user" know about this issue: most of them do not dig into the logs, as the advanced do.

having the parsing workaround

my understanding of parsing is reflexing some kind of syntax. and if there is a grammar for the important information (domain, ip, etc.) why not use it, if the costs for processing is neglegtable.

i will start contributing features like that, but i just get involved in the pihole code. will need some time to make the first pull requests.

stonedbovines commented 6 years ago

Here is my gist demonstrating some code that will strip invalid host names, based off the link I provided above. It requires an input file of 'bare' host names, one per line.

It tests that your system has a '-P' capable version of grep, which works 4x faster than the straight perl equivalent. If you don't have a capable version of grep (eg those using some(?) BSD versions of grep), it uses the slower perl version (testing that you have perl installed!). The tests were to try to make it as portable as possible, while providing a faster option for those capable of using it.

As far as speed goes, I tested it on a pi3 against a 90MB file of 4.2 million host names (30 cat'ed versions of my actual list of 140K) and the grep version took 14.4s, with the perl version taking 1m0.7s, each producing the same output.

I'm not in a position to incorporate it into pihole myself, as I switched to 'raw dnsmasq' over a year ago, so I no longer have a development environment set up for it.

dschaper commented 6 years ago

@stonedbovines What do the speeds look like on your Pi3 if you shadow the environment, say adding LC_ALL=C at line 19 of your gist? I don't have any Pi available at the moment or I'd check that or I'd run it myself.

stonedbovines commented 6 years ago

@dschaper New timings against same 4.2 million line file grep version 8.8s perl version 22.5s

dschaper commented 6 years ago

Thanks, @WaLLy3K what do you think about incorporating the grep version with the LC environment set inside the function so it only masks the environment during the function subshell? We could consider then setting a hard limit of 15s for the reload instead of the longer time we have now.

dschaper commented 6 years ago

And @stonedbovines if you'd like, I can set you up with a dev environment for a week or two with SSH access on a VPS node.

stonedbovines commented 6 years ago

The more I've thought about this, the more I've decided that (IMO!) it is an at-best partial solution to a non-issue.

If I try to ping any of those invalid hosts, linux says 'unknown host ...' and windows says 'could not find host ...', so I'm not going to get to them anyway. Invalid host names don't prevent dnsmasq from running successfully and if dnsmasq doesn't care, I'm not sure I should.

dschaper commented 6 years ago

It's code that we may be able to use in other places, or at least apply the concept to other functions. You have a point in the fact that an invalid name will get an NXDOMAIN response, which is what generates the 'unknown hosts'. Where we could get caught is if a domain is invalid at the time of list generation, and then is registered, it would not be on the list of things to block. And yes, some ad content sites will dynamically register and remove domains. It's why you see 1---2-3-.domain.ad in a lot of things, they are generated frequently and used to try and get around domain sinkholes. That's part of the reason behind wildcard blocking.

david-kracht commented 6 years ago

my perspective: its not about being failsafe, infact dnsmasq has the logic to exlude the invalid domains from beeing keept. but we should, as I mentioned, take care about the blacklists and the quality of them. we should minimize the number of invalid, redundant and domains leading to an nxdomain. personal reseach show, that even for the well-curated lists have 10-20% nxdomain in it, which are not neede to be blacklisted as not existent. i used massdns to query a millions of domains, which helped a lot. but this another topic...

back to my concerns: having feedback about invalid (non-existing) domains is a good thing to establish a "curation cycle" and also help the maintainers of list to assure a good quality list. so bigger is not better in the contest of blacklists.

DL6ER commented 6 years ago

personal reseach show, that even for the well-curated lists have 10-20% nxdomain in it, which are not neede to be blacklisted as not existent

I see no performance penalty with keeping maybe extra 20%. If they are blacklisted they are also not forwarded (even if they would be returned as NXDOMAIN). Just think about those countries where the government is keeping a close eye on their citizens (e.g. a still-European island) and people would like to hide the fact that certain domains are queried at all. Furthermore, as @dschaper mentioned, a domain that is NXDOMAIN at the time the script is run may become active at any time.

jacobsalmela commented 6 years ago

I can confirm this behavior running the default lists.

Jan 14 12:19:27 dnsmasq[872]: bad name at /etc/pihole/gravity.list line 144889
Jan 14 12:19:27 dnsmasq[872]: bad name at /etc/pihole/gravity.list line 144890

ɢoogle.com is the offending domain:

[root@pihole ~]# echo ɢoogle.com | od -An -bc
 311 242 157 157 147 154 145 056 143 157 155 012
 311 242   o   o   g   l   e   .   c   o   m  \n

[root@pihole ~]# echo google.com | od -An -bc
 147 157 157 147 154 145 056 143 157 155 012
   g   o   o   g   l   e   .   c   o   m  \n

dschaper commented 5 years ago

Considering this has been open for more than a year and we have had no other posts/mentions of it, I'm of the mindset that this is intended behaviors and adding this would only increase processing and loading time while potentially causing issues where false NXDOMAIN's are generated.

pi-hole / pi-hole

gravity.sh does not validate domains #1862