ruby / resolv

A thread-aware DNS resolver library written in Ruby
Other
36 stars 28 forks source link

Patch to skip blacklist entries in hosts file #38

Open forthrin opened 1 year ago

forthrin commented 1 year ago

Initialization takes forever with a large hosts blacklist. Proposing the following patch:

index 47c4ef6..4ae0dee 100644
--- a/lib/resolv.rb
+++ b/lib/resolv.rb
@@ -198 +198 @@ class Resolv
-              next unless addr
+              next if !addr || addr.start_with?('0')
forthrin commented 4 months ago

Bump

hanazuki commented 2 months ago

Blocklisting using /etc/hosts aims to inhibit resolving a certain set of domain names system-wide. As Resolv is an alternative to system resolver, I think it is against the purpose of blocklisting for Resolv to ignore 0.0.0.0 entries in /etc/hosts. If it does so, the end users will face a very confusing situation that applications written in C or Go or any other languages respect blocklists, while Ruby apps don't.

For the specific use case where blocklist entries should be ignored, an instance of Resolv::Hosts with the patched behavior can be passed as an argument to Resolv.new().

Generally, I'd not recommend putting such an enormous number of records into /etc/hosts, because the file must be read by every single process involving hostname resolution (if not cached by something like nscd, which is not compatible with some programming languages, including Ruby apps using this library). Instead, you can set up a DNS server that caches hostname-address mapping on memory, such as dnsmasq.

hanazuki commented 2 months ago

To discuss performance we'd be happy to have some numbers. What is the environment? How poor is the current performance? How is it improved with this patch?

A reproducible benchmark would help us spot which part of the code is slow and optimize it.

forthrin commented 2 months ago

I'll get back to you with numbers.

About dnsmasq:

  1. It says it reads from /etc/hosts. Does that mean it takes over the job from the OS in handling this file, and does it much faster?
  2. What other vital, really noticeable benefits does running dnsmasq have? Is it really worth running?
  3. Is it possible to install it on the home router? Or does the home router have to support it from the factory? How does one manage the blacklist on a home router? Do you simply upload a flat file?
  4. Does dnsmasq log/count which blacklist entries are actually used, so you can throw out the unused ones after a month or so?
forthrin commented 2 months ago
$ wc -l /etc/hosts
  228858 /etc/hosts
$ git diff -U0
diff --git a/lib/resolv.rb b/lib/resolv.rb
index e36dbce..0356591 100644
--- a/lib/resolv.rb
+++ b/lib/resolv.rb
@@ -190,0 +191 @@ class Resolv
+        time = Time.now.to_f
@@ -205,0 +207 @@ class Resolv
+        printf "Took %.1fs\n", Time.now.to_f - time

Took 0.6s

-              next unless addr
+              next if !addr || addr.start_with?('0')

Took 0.3s

$ grep -v \0.\0.\0.\0 < /etc/hosts > /etc/hosts # pretend this works :D
$ wc -l /etc/hosts
      69 /etc/hosts

Took 0.0s

forthrin commented 2 months ago

PS! 0.0.0.0 entries are only useful for browsers etc. which connect to a plethora of unwanted servers.resolv is part of HTTPX used for dev projects with full control over what is contacted, thus blacklisting is unnecessary.

hanazuki commented 2 months ago

It looks quite faster than "forever" :)

Self-contained benchmark:

require 'benchmark/ips'
require 'resolv'
require 'tempfile'

hosts = {
  small: 20,
  medium: 2000,
  large: 200000,
}.transform_values do |size|
  f = Tempfile.open('hosts')
  f.write("127.0.0.1 localhost\n")
  size.times do |i|
    f.printf("0.0.0.0 %x.test\n", i)
  end
  f.tap(&:flush)
end

Benchmark.ips do |x|
  x.warmup = 1
  x.time = 5

  hosts.each do |name, f|
    x.report(name) do
      Resolv.new([Resolv::Hosts.new(f.path)]).getaddress('localhost')
    end
  end

  x.compare!
end
% bundle exec ruby ./benchmark/hosts.rb
ruby 3.3.1 (2024-04-23 revision c56cd86388) [x86_64-linux]
Warming up --------------------------------------
               small     1.417k i/100ms
              medium    14.000 i/100ms
               large     1.000 i/100ms
Calculating -------------------------------------
               small     13.765k (± 5.6%) i/s -     69.433k in   5.061995s
              medium    147.119 (± 4.8%) i/s -    742.000 in   5.054971s
               large      0.515 (± 0.0%) i/s -      3.000 in   5.836935s

Comparison:
               small:    13765.1 i/s
              medium:      147.1 i/s - 93.56x  slower
               large:        0.5 i/s - 26739.79x  slower
hanazuki commented 2 months ago

PS! 0.0.0.0 entries are only useful for browsers etc. which connect to a plethora of unwanted servers.resolv is part of HTTPX used for dev projects with full control over what is contacted, thus blacklisting is unnecessary.

Resolv is a generic library that implements hostname resolution (it's not just a part of httpx library but any applications written in Ruby can use it), thus, IMO, it should be neutral on the use cases. Also because /etc/hosts is a system-wide setting, any applications running on the system are expected to respect it by default.

So I think optimizing Resolv for a large /etc/hosts database is good, but changing its behavior in the suggested way is not desirable.

forthrin commented 2 months ago

Agree, but a consistent half second delay on all scripts using the library is sluggishly unacceptable. (I think also "forever" might have been longer, but under other circumstances which fail my memory.)

See questions about dnsmasq. If that (or any other approach) can alleviate the need for swamping /etc/hosts with 200k+ entries, there is no problem here.