instability issues - Githubissues

ragibkl commented 1 year ago

Seems like i would get disconnected once a day, for 2 minutes. Don't know the cause yet, but should look into it.

ragibkl commented 1 year ago

I did the following:

cleanup bind logs. Finally we can close #159
added retry mechanism to ablc Fetch: https://github.com/ragibkl/adblock-list-compiler/commit/dfccb98d2b9d98be874a6351f99fbcf073f31aaf

I thought that maybe the script had issue at some fetches, which panics and causes the container to error out. Not sure if this was the case but patched that anyway. Let's monitor for few days.

Then, I saw some logs:

dns_1           | Fetch ok: https://raw.githubusercontent.com/ragibkl/adblock-dns-server/master/data/overrides.d/ignore-whitelist.zone, attempt: 1
dns_1           | compiling adblock list... done!
dns_1           | writing output file:
dns_1           |     output file: /etc/bind/blacklist.zone
dns_1           |     output format: zone
dns_1           | writing output file: done!
dns_1           | updating blacklist complete
dns_1           | server reload successful
dnsdist_1       | Marking downstream resolver1 (127.0.0.1:1153) as 'down'
dnsdist_1       | Marking downstream resolver1 (127.0.0.1:1153) as 'up'

What happens is, the blacklist update happens in the background. But during the server reload, the server could go down for a few seconds. Not sure if there's a way to do this with zero downtime.

ragibkl commented 1 year ago

Hmm, maybe I'm wrong. Looks like this can fail randomly at times:

dnsdist_1       | [logs] emptying log file
dnsdist_1       | [logs] emptying log file complete
dnsdist_1       | Marking downstream resolver1 (127.0.0.1:1153) as 'down'
dnsdist_1       | Marking downstream resolver1 (127.0.0.1:1153) as 'up'
dnsdist_1       | Marking downstream resolver1 (127.0.0.1:1153) as 'down'
dnsdist_1       | Marking downstream resolver1 (127.0.0.1:1153) as 'up'
dnsdist_1       | [logs] emptying log file
dnsdist_1       | [logs] emptying log file complete
dnsdist_1       | [logs] emptying log file
dnsdist_1       | [logs] emptying log file complete
dnsdist_1       | Marking downstream resolver1 (127.0.0.1:1153) as 'down'
dnsdist_1       | Marking downstream resolver1 (127.0.0.1:1153) as 'up'
dnsdist_1       | [logs] emptying log file
dnsdist_1       | [logs] emptying log file complete
dnsdist_1       | Marking downstream resolver1 (127.0.0.1:1153) as 'down'
dnsdist_1       | Marking downstream resolver1 (127.0.0.1:1153) as 'up'
dnsdist_1       | Marking downstream resolver1 (127.0.0.1:1153) as 'down'
dnsdist_1       | Marking downstream resolver1 (127.0.0.1:1153) as 'up'
dnsdist_1       | [logs] emptying log file
dnsdist_1       | [logs] emptying log file complete
dnsdist_1       | [logs] emptying log file
dnsdist_1       | [logs] emptying log file complete

I wonder if the dns container is having trouble to do full recursive domain resolution. I don't really like to fallback to forwarders mode, as that will mean introducing dns leaks again.

ragibkl commented 1 year ago

At the moment, I'm convinced that the ablc fetch retry will fix this. The current theory is as follows;

each hour, ablc tries to run the compile command.
it tries to fetch the config file. This could fail due to network issues.
When it does, this program panics and the whole container dies
downtime for few minutes
Docker restarts a new container, but takes a while to spin up

Relevant lines: https://github.com/ragibkl/adblock-list-compiler/blob/82768f220c7ce143ca5a75b45fc2be113b869f71/src/cli_run/compile.rs#L35-L38

We'll have to test for few more days to see.

ragibkl commented 1 year ago

I made a couple of fixes.

It seems that sbc and oisd sources were not very stable to fetch. Sometimes fetches would fail, and other times it would work. This means that every hour, the blacklist sources could have massive changes to the zone content. This makes zone reload on the dns layer slow and disruptive. I have changed these sources to use the github raw links instead and that makes this more stable.
I noticed at times that the dns layer would work fine, but the dnsdist would mark dns layer as down. dnsdist talks to dns over local network, on localhost:1153. There might be something finicky with using docker host network that causes this healthcheck to fail sometimes. I've disable healthcheck, and that seems to help.
I believe that downtimes are related mostly to auto updates. Previously auto-updates happen every hour, but maybe that's too aggressive. I'm changing this to every 6 hours, to reduce the likelihood of interruptions. I might even make this every 24 hours in future. Maybe.
I had a theory that cache setups on the dns layer was too low. I've made some adjustments here. If I make more money in the future, I would love to bump the RAM of each server to about 8GiB, but I think this is fine for now.

I'll keep monitoring for a few more days, but I do feel the stats are much better now.

ragibkl commented 1 year ago

This looks more stable now. I'm closing this.

ragibkl / adblock-dns-server

instability issues #172