Closed david-kracht closed 5 years ago
There is a regex on stackoverflow by Andrew Domaszek (et al, adding to mkyong's earlier entry) that successfully invalidates those non-domains. It also provides an explanation of its working.
+1 from me. It seems that my proposed regex is not general enough: it rejects g.co i would go with the most general expression for filtering.
@stonedbovines Do you want to code that and submit a Pull Request with the regex changes?
I'm not convinced that the performance hit to validate (in my case) 140K domains to catch a few handfuls of rogue entries, that shouldn't be there anyway, is worth it. Not using the block list responsible for them seems to be a better solution.
EDIT: [Ignore this. Was due to poor implementation.]
I timed the validation of my 140K domains on an otherwise idle pi3. It took over 35 minutes. Now I think that almost 67 validations per second is pretty damn good, but a 35 minute task is not the answer.
what? i cant believe that a grep can take that long? my rp3 need 13s to grep my first guess of a regex on 3.6 million domains. i believe in some overhead for additional text generation, but i would assume sub-minute time for a million entries. dont know.
I was actually using call outs to a perl script (passing a single domain) because, for my original purposes, I was only ever going to be testing one possible domain at a time.
So, yes, grep it as a whole and it became a sub-second process. Mea culpa!
With the way grep
parses, you might even pick up some performance if you set the environment to C. A little tip I just found on Stack mentions running in parallel if you have multiple cores. Take a look at https://stackoverflow.com/questions/9066609/fastest-possible-grep for a bunch of optimizations.
When manually adding a domain, we already perform domain validation. As for "invalid" domains pulled from a remote list, there's a couple of considerations:
First and foremost, the list maintainer should be making lists available in HOSTS format, which means domain validation as well as punycode conversion. If you notice an issue with a third party list, the maintainer should be informed about it!
If we need to compensate for ignorant list maintainers, the appropriate place to add a fix would be near this block of code as Pi-hole's query functionality needs to be able to search for the correct domains from individual *.domains lists, opposed to query only finding the "invalid" domains. By having the parsing workaround near this block of code, Pi-hole will only need to parse one or two lists that have errors, instead of the entire gravity.list
.
maintainer should be informed about it!
i totally agree, but the common user should have the option/information to see that there is something wrong with the list they added. a feedback can not be wrong to shorten (and improve) maintaining cycles for the blocklists. how should the "ordinary user" know about this issue: most of them do not dig into the logs, as the advanced do.
having the parsing workaround
my understanding of parsing is reflexing some kind of syntax. and if there is a grammar for the important information (domain, ip, etc.) why not use it, if the costs for processing is neglegtable.
i will start contributing features like that, but i just get involved in the pihole code. will need some time to make the first pull requests.
Here is my gist demonstrating some code that will strip invalid host names, based off the link I provided above. It requires an input file of 'bare' host names, one per line.
It tests that your system has a '-P' capable version of grep, which works 4x faster than the straight perl equivalent. If you don't have a capable version of grep (eg those using some(?) BSD versions of grep), it uses the slower perl version (testing that you have perl installed!). The tests were to try to make it as portable as possible, while providing a faster option for those capable of using it.
As far as speed goes, I tested it on a pi3 against a 90MB file of 4.2 million host names (30 cat'ed versions of my actual list of 140K) and the grep version took 14.4s, with the perl version taking 1m0.7s, each producing the same output.
I'm not in a position to incorporate it into pihole myself, as I switched to 'raw dnsmasq' over a year ago, so I no longer have a development environment set up for it.
@stonedbovines What do the speeds look like on your Pi3 if you shadow the environment, say adding LC_ALL=C
at line 19 of your gist? I don't have any Pi available at the moment or I'd check that or I'd run it myself.
@dschaper New timings against same 4.2 million line file grep version 8.8s perl version 22.5s
Thanks, @WaLLy3K what do you think about incorporating the grep version with the LC environment set inside the function so it only masks the environment during the function subshell? We could consider then setting a hard limit of 15s for the reload instead of the longer time we have now.
And @stonedbovines if you'd like, I can set you up with a dev environment for a week or two with SSH access on a VPS node.
The more I've thought about this, the more I've decided that (IMO!) it is an at-best partial solution to a non-issue.
If I try to ping any of those invalid hosts, linux says 'unknown host ...' and windows says 'could not find host ...', so I'm not going to get to them anyway. Invalid host names don't prevent dnsmasq from running successfully and if dnsmasq doesn't care, I'm not sure I should.
It's code that we may be able to use in other places, or at least apply the concept to other functions. You have a point in the fact that an invalid name will get an NXDOMAIN response, which is what generates the 'unknown hosts'. Where we could get caught is if a domain is invalid at the time of list generation, and then is registered, it would not be on the list of things to block. And yes, some ad content sites will dynamically register and remove domains. It's why you see 1---2-3-.domain.ad
in a lot of things, they are generated frequently and used to try and get around domain sinkholes. That's part of the reason behind wildcard blocking.
my perspective: its not about being failsafe, infact dnsmasq has the logic to exlude the invalid domains from beeing keept. but we should, as I mentioned, take care about the blacklists and the quality of them. we should minimize the number of invalid, redundant and domains leading to an nxdomain. personal reseach show, that even for the well-curated lists have 10-20% nxdomain in it, which are not neede to be blacklisted as not existent. i used massdns to query a millions of domains, which helped a lot. but this another topic...
back to my concerns: having feedback about invalid (non-existing) domains is a good thing to establish a "curation cycle" and also help the maintainers of list to assure a good quality list. so bigger is not better in the contest of blacklists.
personal reseach show, that even for the well-curated lists have 10-20% nxdomain in it, which are not neede to be blacklisted as not existent
I see no performance penalty with keeping maybe extra 20%. If they are blacklisted they are also not forwarded (even if they would be returned as NXDOMAIN
). Just think about those countries where the government is keeping a close eye on their citizens (e.g. a still-European island) and people would like to hide the fact that certain domains are queried at all. Furthermore, as @dschaper mentioned, a domain that is NXDOMAIN
at the time the script is run may become active at any time.
I can confirm this behavior running the default lists.
Jan 14 12:19:27 dnsmasq[872]: bad name at /etc/pihole/gravity.list line 144889
Jan 14 12:19:27 dnsmasq[872]: bad name at /etc/pihole/gravity.list line 144890
ɢoogle.com is the offending domain:
[root@pihole ~]# echo ɢoogle.com | od -An -bc
311 242 157 157 147 154 145 056 143 157 155 012
311 242 o o g l e . c o m \n
[root@pihole ~]# echo google.com | od -An -bc
147 157 157 147 154 145 056 143 157 155 012
g o o g l e . c o m \n
Considering this has been open for more than a year and we have had no other posts/mentions of it, I'm of the mindset that this is intended behaviors and adding this would only increase processing and loading time while potentially causing issues where false NXDOMAIN's are generated.
In raising this issue, I confirm the following:
{please fill the checkboxes, e.g: [X]}
How familiar are you with the the source code relevant to this issue?: 5
Expected behaviour: Invalid domains should be rejected/ignored.
Actual behaviour: Invalid domains in block lists can cause error with dnsmasq conf files.
Steps to reproduce: Insert a non valid domain in the blocklists and run the gravity.sh. Reassesment of my twisted
gravity.list
, see the following logs from dnsmasq:> #sed -n '9p;10p;355079p;626527p;667778p;718960p;947709p;960111p;1014685p;1102943p' /etc/pihole/gravity.list
Debug token provided by uploading
pihole -d
log:yqltpy1g7c
Troubleshooting undertaken, and/or other relevant information:
A regex expression
, e.g.might be used to validate domains.(?:[a-zA-Z0-9](?:[a-zA-Z0-9\-]*[a-zA-Z0-9])?\.)+[a-zA-Z]+
A more refined regex
, i.e.might simplify the filtering for valid expressions for host files and adblock lists, too. I am investigating in the improvement, but I have not found the final solution now. Maybe I will open a feature request beyond the stated bug here, later.'(\@\@)?(([\|]{0,2})(https?:\/\/)?(?:[a-zA-Z0-9](?:[a-zA-Z0-9\-]*[a-zA-Z0-9])?\.)+[a-zA-Z]+)($| |\^|\/\|)(?![^#\!]*\$)