lixingcong / dnsmasq-regex

dnsmasq with regex match module(libpcre v8.45, the older version)
66 stars 15 forks source link

regex can't check URI schemes, default nameserver (#) was broken #8

Closed leodexe closed 1 year ago

leodexe commented 1 year ago

I have setup a list of allowed domains which previously worked on official dnsmasq like this: [dnsmasq.conf] # AdGuard IPv4/IPv6 nameservers server=140.90.14.15 server=140.90.15.16 server=2a10:50c0::bad1:ff server=2a10:50c0::bad2:ff

[blocklist.conf] address=/*/# server=/*.edu/# server=/*.io/# server=/*.org/#

The first address line blocks everything, so each individual domain that does not end in .edu, .io or .org needs to be unblocked manually, which is how I have configured my dnsmasq blocklist as you will see below, normally it would look like this: server=/*abc.com/#

non-regex addresses like the three server lines after the address line works fine, however, when I use the regex syntax, the hashtag symbol (#) that normally redirects to the nameservers I previously configured stops working, along with dnsmasq built-in subdomain wildcard () support, which would accurately unblock both abc.com and subdomains of .abc.com, while also preventing any typos from being unblocked like aabc.com and abcabc.org, etc. According to Perl Syntax described here, the [^...] metacharacter should match anything that's NOT within the brackets, this is required so the pattern can have more specificity and thus avoid blocking the previous examples of typos, see this Squid ERE regex syntax for example: ^.*\.?[^a-zA-Z0-9](keyword1|keyword2)\.[a-zA-Z]{2,}.*$

This Squid url_regex pattern blocks every keyword that's included in the pattern, the dnsmasq-regex equivalent which uses PCRE syntax should look something like this: server=/:.*[.]?[^\w]abc[.]:/#

However since the hashtag (#) symbol stops working properly as previously stated, I have to manually specify the nameservers I already configured which can be quite cumbersome as I have many many domains that redirects to the default nameserver: server=/:.*[.]?[^\w]abc[.]:/1.1.1.3

This does not work as expected, removing the [^\w] from the pattern will give unintended matches like aaaaaaaaaabc.com which is why it must be there to accurately block only the specified domain, but is the closest thing to a functional pattern.

I have compiled dnsmasq-regex with the options that are enabled by default on official dnsmasq package which are: #define HAVE_DBUS #define HAVE_CONNTRACK #define HAVE_IDN #define HAVE_LIBIDN2 #define HAVE_NFTSET #define HAVE_DNSSEC

Since without HAVE_DBUS enabled, the new compiled dnsmasq-regex completely fails to launch, the others are also there because they already come with standard dnsmasq, so just for precaution I added them back.

Last but not least, check this

lixingcong commented 1 year ago

Please provide git commit, your full config file and run dnsmasq with -q -d to see what happen

For example:

Commit https://github.com/lixingcong/dnsmasq-regex/commit/c40a35d921bcb7ba027fdd3e182156bf160ca09a

# config file is /tmp/dnsmasq-example.conf
port=30000
server=1.1.1.1
server=/\.edu/8.8.8.8

Run command

./dnsmasq/src/dnsmasq -C /tmp/dnsmasq-example.conf -q -d

Run dig from another terminal

dig @localhost -p 30000 1.edu

Post your dnsmasq log

dnsmasq: started, version 2.87 cachesize 150
dnsmasq: compile time options: IPv6 GNU-getopt no-DBus no-UBus no-i18n regex no-IDN DHCP DHCPv6 no-Lua TFTP no-conntrack ipset no-nftset auth no-cryptohash no-DNSSEC loop-detect inotify dumpfile
dnsmasq: using nameserver 1.1.1.1#53
dnsmasq: query[A] 1.edu from 127.0.0.1
dnsmasq: forwarded 1.edu to 8.8.8.8
dnsmasq: reply 1.edu is NXDOMAIN
leodexe commented 1 year ago

Commit dab0490

/etc/dnsmasq.conf

clear-on-reload

# Never forward plain names (without a dot or domain part)
domain-needed

# Never forward addresses in the non-routed address spaces.
bogus-priv

# Don't read /etc/resolv.conf
no-resolv

# Don't poll /etc/resolv.conf
no-poll

# Nameservers
# AdGuard
server=94.140.14.15 #IPv4
server=94.140.15.16 #IPv4
server=2a10:50c0::bad1:ff #IPv6
server=2a10:50c0::bad2:ff #IPv6

# If you don't want dnsmasq to read /etc/hosts, uncomment this
no-hosts

# Set the cachesize here.
cache-size=1000

# Include another lot of configuration options.
conf-dir=/etc/dnsmasq.d

# Include all the files in a directory except those ending in .bak
conf-dir=/etc/dnsmasq.d,.bak

# Include all files in a directory which end in .conf
conf-dir=/etc/dnsmasq.d/,*.conf

/etc/dnsmasq.d/blocklist.conf

#Blacklist ALL (Redirects to 0.0.0.0 if not whitelisted)
address=/*/#

# Whitelist any domain that ends in .edu, or .io, or .org
server=/*.edu/#
server=/*.io/#
server=/*.org/#

# Non-regex (Regular) domains works properly with (#)
server=/*github.com/#
server=/*google.com/#
server=/*pkgbuild.com/#
server=/*youtube.com/#
# server=/... .../# (the list goes on)

# Regex domains, but closing with a (#) forwards an invalid address.
# as if they weren't actually whitelisted, 0.0.0.0 is the default
# return address for domains that aren't explicitly whitelisted
# so I have to manually put nameserver, 1.1.1.3 in this example.
server=/:.*[.]?abc[.]:/1.1.1.3
server=/:.*[.]?discord[.]:/#
server=/:.*[.]?msn[.]:/1.1.1.3
server=/:.*[.]?yahoo[.]:/#
# server=/... .../1.1.1.3 (the list goes on)

Here is my dnsmasq log:

[root@nitrolulz ~]# ./d2/dnsmasq/src/dnsmasq -C /etc/dnsmasq-regex.conf -q -d
dnsmasq: started, version 2.87 cachesize 1000
dnsmasq: compile time options: IPv6 GNU-getopt DBus no-UBus no-i18n regex(+ipset) IDN2 DHCP DHCPv6 no-Lua TFTP conntrack ipset nftset auth cryptohash DNSSEC loop-detect inotify dumpfile
dnsmasq: using nameserver 94.140.14.15#53
dnsmasq: using nameserver 94.140.15.16#53
dnsmasq: using nameserver 2a10:50c0::bad1:ff#53
dnsmasq: using nameserver 2a10:50c0::bad2:ff#53
dnsmasq: using nameserver 1.1.1.3#53 for regex domain .*[.]?abc[.] 
dnsmasq: using nameserver 1.1.1.3#53 for regex domain .*[.]?msn[.] 
dnsmasq: using standard nameservers for .*[.]?yahoo[.]
dnsmasq: using standard nameservers for .*[.]?discord[.]
dnsmasq: using standard nameservers for youtube.com
dnsmasq: using standard nameservers for pkgbuild.com
dnsmasq: using standard nameservers for google.com
dnsmasq: using standard nameservers for github.com
dnsmasq: using standard nameservers for .org
dnsmasq: using standard nameservers for .io
dnsmasq: using standard nameservers for .edu
dnsmasq: cleared cache
dnsmasq: query[AAAA] www.youtube.com from ::1
dnsmasq: forwarded www.youtube.com to 94.140.14.15
dnsmasq: forwarded www.youtube.com to 94.140.15.16
dnsmasq: forwarded www.youtube.com to 2a10:50c0::bad1:ff
dnsmasq: forwarded www.youtube.com to 2a10:50c0::bad2:ff
dnsmasq: query[A] abc.com from ::1
dnsmasq: forwarded abc.com to 1.1.1.3
dnsmasq: query[AAAA] abc.com from ::1
dnsmasq: forwarded abc.com to 1.1.1.3
dnsmasq: reply abc.com is NODATA-IPv6
dnsmasq: reply abc.com is 18.155.1.93
dnsmasq: reply abc.com is 18.155.1.91
dnsmasq: reply abc.com is 18.155.1.106
dnsmasq: reply abc.com is 18.155.1.2
dnsmasq: query[AAAA] discord.com from ::1
dnsmasq: config discord.com is ::
dnsmasq: query[A] discord.com from ::1
dnsmasq: config discord.com is 0.0.0.0

So from what I'm seeing is that when the hashtag (#) symbol is present in a regex domain, it messes with the syntax so is read instead as a regular domain, and because this now incomplete domain doesn't actually match anything, it returns 0.0.0.0 which is the expected address for non-whitelisted domains.

Standard domains do and can support (#) , but regex domains cannot, as you can see in the log.

Also, is the [^...] pattern actually supported? I have tried [^\w] but it doesn't seem to work, without that the pattern will return unintended matches unrelated to the specified keyword.

For example: server=/:.*[.]?yahoo[.]:/1.1.1.1 will match ayahoo. , nzyahoo. and so on as long there is a "yahoo" somewhere in the pattern.

How would you solve this so it only matches yahoo.* specifically?

lixingcong commented 1 year ago

Maybe fix via https://github.com/lixingcong/dnsmasq-regex/commit/6068f8373ac1994bfca4f38e02a363b40d7dcef3, please test again.

leodexe commented 1 year ago

Yeah, https://github.com/lixingcong/dnsmasq-regex/commit/6068f8373ac1994bfca4f38e02a363b40d7dcef3 fixes regex domain functionality with #

As for the other issue, I've tested many patterns and I eventually figured out that the regex doesn't evaluate the URI as a whole, instead it starts the pattern in the domain, completely ommiting the URI scheme (http[s]://, ftp://, magnet://, etc...) As a result, domains that lack a subdomain will trigger a match, like (www., mobile., media, , static, touch, etc...) where as the same domain with a subdomain, will not trigger a match despite both using the same keywords

For example: address=/:^[\w]*[.]?abc[.][a-z]{2,}$:/# Where "abc" is the keyword for the domain name

Will not match 1. https://www.aabc.com/ and 2. https://www.edabc.com/ But will match 3. http://aabc.org/ and 4. https://edabc.org/ The first and second domains don't match they keyword "abc", so they don't get blocked. But the third and fourth domains also don't match the same keyword "abc", yet they get blocked anyways.

If there is no subdomain, it will cause a different behavior where it will trigger matches that normally would not happen, because currently the pattern will be unable to check anything before the subdomain part.

Could you address this somehow? Like, the pattern should start checking from the very beggining which is the URI scheme itself, so this should give more consistency as the user will be able to evaluate the URI scheme before the subdomain, giving more specificity to patterns and hoping it will fix unintended matches.

lixingcong commented 1 year ago

Go to https://regexr.com/ and test your regex pattern with many test cases.

The problem you described may caused by 3rd-party lib. I can do nothing for debugging libpcre3.

leodexe commented 1 year ago

Go to https://regexr.com/ and test your regex pattern with many test cases.

The problem you described may caused by 3rd-party lib. I can do nothing for debugging libpcre3.

According to regexr: "PHP 8.1.0 and PCRE 10.37 2021-05-26 are used to execute your pattern on our server."

This expression: ^[\w]+[\W]+[\w]*[.]?[^\w]abc[.][a-z]{2,}$

matches the following:

http://abc.com
https://abc.net
http://www.abc.es
https://www.abc.def

and doesn't match everything else, which is exactly what I want.

That works on regexr, but it doesn't work on dnsmasq-regex for some reason.

It doesn't seem to evaluate properly the [^...] metachar, because it triggers a match on regexr, but not on dnsmasq-regex, and is also unable to check the URI scheme, so it ends up being like this: ^[\w]*[.]?abc[.][a-z]{2,}$

Is this a bug, you say?

lixingcong commented 1 year ago

This project is built with libpcre3-dev (the older version of pcre, may be 8.39, refer: debian apt repo)

Refer to https://www.pcre.org, it says

There are two major versions of the PCRE library. The current version, PCRE2, now at version 10.39. The older, but still widely deployed PCRE library, originally released in 1997, is at version 8.45.

I have no idea on what differences between those versions. I just appy the dnsmasq patches from mailing list.

If you prefer adapting to lateset PCRE, implement your modified verison, any pull requests are welcome.

leodexe commented 1 year ago

It's been a while, I've been investigating more and it seems dnsmasq, both non-regex and regex are unable to entirely read the full URL because HTTPS is encrypted so when it tries to read the URL, it can only retrieve the domain portion which consists of www.example.com, possibly causing some regex patterns to not work as expected. To fully read URLs with HTTPS it needs to be decrypted which requires SSL bumping, a feature dnsmasq lacks, but Squid Proxy has.

Meanwhile, I fixed the older expression I used on my first attempts:

This expression: ^[\w]+[\W]+[\w]*[.]?[^\w]abc[.][a-z]{2,}$ ... It doesn't seem to evaluate properly the [^...] metachar, because it triggers a match on regexr, but not on dnsmasq-regex, and is also unable to check the URI scheme, so it ends up being like this: ^[\w]*[.]?abc[.][a-z]{2,}$

Previously, that didn't work well, because there was still one more case that wasn't covered.

But now the following pattern:

^([\w]*\.?[\W]|)example(\.[a-zA-Z]{2,})+$
# Will match both example.anything and subdomain.example.anything
# As long as the TLD is at least two alphabetic characters long
# Matches: sub.example.any, other.example.any.thing

# Note: dnsmasq understands named classes so this expression equals the previous one:
^([[:alnum:]]*\.?[^[:alnum:]]|)example(\.[[:alpha:]]{2,})+$

Now repeated characters like eeeeexample.com are succesfully blocked when there is no subdomain. With this expression, all cases are properly covered which was the behavior I wanted but couldn't replicate on my earlier attempts. This works as expected as it will only filter the matched keyword and nothing more.

However, dnsmasq still SHOULD be able to fully read unencrypted URLs (http without the s) Squid Proxy is able to fully read unencrypted HTTP but dnsmasq cannot.

This could be possibly intended by design because the Arch Linux Wiki guide on dnsmasq doesn't mention the protocol part on the Domain blocklisting section, as a lightweight caching server it makes sense dnsmasq not having all the features to properly block stuff as Squid Proxy does.