rthalley / dnspython

a powerful DNS toolkit for python
http://www.dnspython.org
Other
2.45k stars 517 forks source link

Lifetime parameter not being respected #1152

Open AGBell opened 4 days ago

AGBell commented 4 days ago

Example:

I put just the relevant part, I of course have my nameservers and other options set.

    resolver = dns.asyncresolver.Resolver(configure=False)
    resolver.timeout = 0.05
    resolver.lifetime = 0.5

Lifetime is a part of Resolver class and timeout is part of the inherited BaseResolver class.

I get errors like these:

The resolution lifetime expired after 10.411 seconds: Server Doxx:xxx.xx.xxx.xxx@xxxx answered The DNS operation timed out after 0.000 seconds.

How does this go to 10.411 seconds if my lifetime is to 0.5 seconds?

rthalley commented 4 days ago

Do you have a more complete working replication? And are your nameservers regular DNS over port 53 or some other protocol like DoH? When I test with this code:

import dns.asyncresolver

import asyncio

res = dns.asyncresolver.Resolver()
res.lifetime = 0.001

async def run():
    await res.resolve("www.dnspython.org", "A")

asyncio.run(run())

I see a traceback ending with

dns.resolver.LifetimeTimeout: The resolution lifetime expired after 0.102 seconds: Server Do53:127.0.0.1@53 answered The DNS operation timed out after 0.001 seconds

as expected, exceeding the lifetime very quickly.

rthalley commented 4 days ago

I also tested with trio, which also worked.

AGBell commented 4 days ago

Yeah they are regular DNS over port 53, I do a NAPTR query. Here's some more information:

This is how I have in configured:

import dns
import dns.e164
import dns.asyncresolver
import random

dns_servers = {"xxx.xxx.xxx.xxx": 53, "yyy.yyy.yyy.yyy": 53, "zzz.zzz.zzz.zzz": 53}
resolver = dns.asyncresolver.Resolver(configure=False)
resolver.nameservers = list(dns_servers.keys())
resolver.nameserver_ports = dns_servers
resolver.edns = None
resolver.cache = None
resolver.timeout = 0.05
resolver.lifetime = 0.5
resolver.retry_servfail = True

Then I do a shuffle on the nameservers:

random.shuffle(resolver.nameservers)

And lastly:

await resolver.resolve(qname=dns.e164.from_e164(number, origin=dns.name.from_text(origin)), rdtype="NAPTR")

number is just a phone number and origin is e164.arpa.

Not sure how I could prove it, but I get errors like these with timeouts way higher than the 0.5 lifetime I set, with this exact set up:

The resolution lifetime expired after 8.544 seconds: Server Doxx:xxx.xx.xxx.xxx@53 answered The DNS operation timed out after 0.000 seconds.

rthalley commented 3 days ago

OK, I'll dig deeper though it won't be until the weekend at least. I definitely believe you're seeing what you are reporting, I just haven't figured out how to see it myself yet.

Are you using asyncio or trio?

rthalley commented 3 days ago

Oh, and I missed that you are using windows. I will give things a try on my windows box, though it shouldn't matter :)

AGBell commented 3 days ago

OK, I'll dig deeper though it won't be until the weekend at least. I definitely believe you're seeing what you are reporting, I just haven't figured out how to see it myself yet.

Are you using asyncio or trio?

I'm using asyncio

AGBell commented 3 days ago

Oh, and I missed that you are using windows. I will give things a try on my windows box, though it shouldn't matter :)

Sounds good, thank you!

Edit: sorry, I was testing stuff on my windows machine but these errors are happening on Red Hat Enterprise Linux 9.4 (Plow).

rthalley commented 3 days ago

I have a bit more info...

So, let's look at the exception in more detail. There are two levels going on there, the "resolution level" which includes all attempts that the resolver is making to resolve the query, and the individual query attempt level, which is basically the resolver making one call to dns.asyncquery.udp().

This first part is the resolver level:

The resolution lifetime expired after 8.544 seconds: 

Investigation reveals that the number reported here is, a bit confusingly, NOT the lifetime value configured, but rather the actual elapsed time from when dnspython started the query to where it got to the top of the loop to considering sending another query, and discovered that there was no time left.

It then lists all the errors that occurred during the resolution, and there is only one:

Server Doxx:xxx.xx.xxx.xxx@53 answered The DNS operation timed out after 0.000 seconds.

Note the 0.000 seconds! This means that one of the two things that have a timeout, which are the send and the receive of an individual query timed out after waiting for no time, which implies that the whole lifetime had been consumed before the receive started.

From what you've said so far, I can't tell if it applies to you, but one thing that can happen if you (say) launch a zillion queries as tasks, is that launching and subsequent processing may take so long that by the time you get to some pending thing, it has already expired.

At any rate, so far I see no bugs in the code, and (in more detail) the data imply that:

  1. We computed some reasonable lifetime expiration in the future, presumably what you configured. This time was definitely in the future when we computed it as if the lifetime had expired right away we wouldn't have recorded a timeout error from the query attempt, as we wouldn't have made it.

  2. We tried to send a query and wait for a response. The send worked because asyncio sendto() doesn't block at the network level and thus has no timeout check. AFAIK context switching will still occur as our send wrapper is an async function. We then tried to wait for the response, which involves computing what the remaining timeout is before we start the I/O, and the answer we got was 0, implying that at least the lifetime had passed while while we were context switched between sending and receiving.

  3. The actual blockage time was the 8.544 seconds.

Does any of this sound reasonable for how your program works? In particular, is there any way blocking for a long time could occur between the start of execution of a task and the next time it gets scheduled, due to a lot of other tasks or non-yielding CPU consumption?

AGBell commented 1 day ago

Thank you for the detailed information. As for my use case:

I guess to keep it short for me, I thought lifetime was the global timeout regardless of where the value is coming from. I read it in a way where if I set a lifetime of 0.5 then I shouldn't have queries taking seconds but rather in the range of 0.5 second to 1 second let's call it. Is this something that can be done as a feature (a true max timeout) or am better off implementing this around my code? Thanks.

rthalley commented 1 day ago

We are enforcing the limit as globally as possible, but we have to be able to run a little bit to return the timeout exception, and to be able to hide async backend details. The problem is we are being kept asleep for a long time by the asyncio runtime because there is too much happening. So even though we should be getting the TimeoutError exception from asyncio.wait_for(), we're not because we're not getting to run for like 8+ seconds.

I don't think wrapping this at a higher level will help, e.g. if you use asyncio.wait_for() directly, as the overload will still delay your getting the exception by a similar amount of time.

Probably you should make fewer queries a second so that you don't get into a livelock situation where all of your queries are basically expiring from old age without actually doing any work.

AGBell commented 21 hours ago

Understood, I don't have the option of making less queries per second :)

Just as an aside, I tested using non-async resolver and still getting these.

Anyways, thanks for looking into this!