Open forthrin opened 1 year ago
Bump
Blocklisting using /etc/hosts
aims to inhibit resolving a certain set of domain names system-wide. As Resolv is an alternative to system resolver, I think it is against the purpose of blocklisting for Resolv to ignore 0.0.0.0
entries in /etc/hosts
. If it does so, the end users will face a very confusing situation that applications written in C or Go or any other languages respect blocklists, while Ruby apps don't.
For the specific use case where blocklist entries should be ignored, an instance of Resolv::Hosts
with the patched behavior can be passed as an argument to Resolv.new()
.
Generally, I'd not recommend putting such an enormous number of records into /etc/hosts
, because the file must be read by every single process involving hostname resolution (if not cached by something like nscd, which is not compatible with some programming languages, including Ruby apps using this library). Instead, you can set up a DNS server that caches hostname-address mapping on memory, such as dnsmasq.
To discuss performance we'd be happy to have some numbers. What is the environment? How poor is the current performance? How is it improved with this patch?
A reproducible benchmark would help us spot which part of the code is slow and optimize it.
I'll get back to you with numbers.
About dnsmasq:
/etc/hosts
. Does that mean it takes over the job from the OS in handling this file, and does it much faster?$ wc -l /etc/hosts
228858 /etc/hosts
$ git diff -U0
diff --git a/lib/resolv.rb b/lib/resolv.rb
index e36dbce..0356591 100644
--- a/lib/resolv.rb
+++ b/lib/resolv.rb
@@ -190,0 +191 @@ class Resolv
+ time = Time.now.to_f
@@ -205,0 +207 @@ class Resolv
+ printf "Took %.1fs\n", Time.now.to_f - time
Took 0.6s
- next unless addr
+ next if !addr || addr.start_with?('0')
Took 0.3s
$ grep -v \0.\0.\0.\0 < /etc/hosts > /etc/hosts # pretend this works :D
$ wc -l /etc/hosts
69 /etc/hosts
Took 0.0s
PS! 0.0.0.0
entries are only useful for browsers etc. which connect to a plethora of unwanted servers.resolv
is part of HTTPX
used for dev projects with full control over what is contacted, thus blacklisting is unnecessary.
It looks quite faster than "forever" :)
Self-contained benchmark:
require 'benchmark/ips'
require 'resolv'
require 'tempfile'
hosts = {
small: 20,
medium: 2000,
large: 200000,
}.transform_values do |size|
f = Tempfile.open('hosts')
f.write("127.0.0.1 localhost\n")
size.times do |i|
f.printf("0.0.0.0 %x.test\n", i)
end
f.tap(&:flush)
end
Benchmark.ips do |x|
x.warmup = 1
x.time = 5
hosts.each do |name, f|
x.report(name) do
Resolv.new([Resolv::Hosts.new(f.path)]).getaddress('localhost')
end
end
x.compare!
end
% bundle exec ruby ./benchmark/hosts.rb
ruby 3.3.1 (2024-04-23 revision c56cd86388) [x86_64-linux]
Warming up --------------------------------------
small 1.417k i/100ms
medium 14.000 i/100ms
large 1.000 i/100ms
Calculating -------------------------------------
small 13.765k (± 5.6%) i/s - 69.433k in 5.061995s
medium 147.119 (± 4.8%) i/s - 742.000 in 5.054971s
large 0.515 (± 0.0%) i/s - 3.000 in 5.836935s
Comparison:
small: 13765.1 i/s
medium: 147.1 i/s - 93.56x slower
large: 0.5 i/s - 26739.79x slower
PS!
0.0.0.0
entries are only useful for browsers etc. which connect to a plethora of unwanted servers.resolv
is part ofHTTPX
used for dev projects with full control over what is contacted, thus blacklisting is unnecessary.
Resolv is a generic library that implements hostname resolution (it's not just a part of httpx
library but any applications written in Ruby can use it), thus, IMO, it should be neutral on the use cases. Also because /etc/hosts
is a system-wide setting, any applications running on the system are expected to respect it by default.
So I think optimizing Resolv for a large /etc/hosts database is good, but changing its behavior in the suggested way is not desirable.
Agree, but a consistent half second delay on all scripts using the library is sluggishly unacceptable. (I think also "forever" might have been longer, but under other circumstances which fail my memory.)
See questions about dnsmasq. If that (or any other approach) can alleviate the need for swamping /etc/hosts with 200k+ entries, there is no problem here.
Initialization takes forever with a large hosts blacklist. Proposing the following patch: