questdb / py-questdb-client

Python client for QuestDB InfluxDB Line Protocol
https://py-questdb-client.readthedocs.io
Apache License 2.0
55 stars 10 forks source link

HTTP timeouts not respected in flush #75

Open nicholas-a-guerra opened 7 months ago

nicholas-a-guerra commented 7 months ago

Using the questdb python client version 2.0.0 it seems that the default http timeouts are not being respected when the network connection degrades. The minimal reproducible code is as follows:

self.sender = Sender(Protocol.Http, self.config['IP'], self.config['HTTP_Port'], auto_flush=False)
self.sender.establish()
self.buffer = Buffer()
'''
success = False
try:
    self.flush_start = perf_counter()
    with self.buffer_lock:
        after_lock = perf_counter()
        self.sender.flush(self.buffer)
    success = True            
except:
    logger.opt(exception=True).warning(f"QuestDB write error | {self.config['IP']}")
logger.debug(f"QuestDB flush | {self.config['IP']} | Success: {success} | Lock obtained: {after_lock-self.flush_start}s | Flush: {perf_counter()-after_lock}s")

For reference, there is another loop that feeds this buffer using the lock. But this is irrelevant to the problem I'm having and I've printed the lock acquire time to isolate it as an issue. The results show a growing flush time, for example this is an an exact output from my logs:

2024-03-30 12:26:52.741 EDT [DEBUG] | QuestDB_Logger:flush_loop:133 | QuestDB flush | kronus-nexus | Success: False | Lock obtained: 9.148032404482365e-06s | Flush: 822.764772413997s

You can see the lock isn't taking almost any time so that's not the problem, but we're seeing growing flush time now seen here at 822 seconds. Here are the ping stats for reference during this time:

21 packets transmitted, 17 received, 19.0476% packet loss, time 20072ms
rtt min/avg/max/mdev = 59.412/318.782/639.467/145.486 ms

The expectation would be that with the default http configuration, for the flush to never take longer than 10 seconds no matter what. Currently this time is increasing over time. It may be growing in conjunction with the growing buffer but that's difficult to know without knowledge of how the internals of flush work. It also seems that once the connection degrades and the buffer starts growing, all consecutive flush attempts just fail. If the buffer gets small enough and the connection gets slightly better then the flush will finally succeed as normal.

amunra commented 7 months ago

The timeout is on a per-request basis, not on a per-flush. During retries the timeout resets. Looking at your code, it looks like you're trying to control the behaviour in more detail. if that's what you're after you might want to disable retries and handle them explicitly. In writing this I've noticed the docs are slightly wrong. Use 0 to disable the retry logic: https://py-questdb-client.readthedocs.io/en/latest/conf.html#http-request. You also mentioned threading and locks. I also have a draft-phase PR that might be relevant to you or that you might draw some ideas from https://github.com/questdb/py-questdb-client/pull/66 for achieving best performance out of HTTP.

nicholas-a-guerra commented 7 months ago

If the metrics are not specific to flush then is there currently no control over how long a flush will hang? What is the current expectation? I'm seeing retries growing out of control into several minutes with default settings. Is this how you'd expect flush to behave? If this is expected then I would think most users would have to disable retries as several minute hangs on a function call is never ideal. You mention disabling retries by using 0 as the input. How specifically would this work? What would my control mechanism be for the single timeout? For example, I want it to try only once for 10 seconds, what exact parameters would I choose?

amunra commented 7 months ago

I'm on holiday at the moment. I'll be able to look into this more in detail once I'm back if it's indeed a bug. Hopefully the pointers I'm about to give you will help out in the mean time.

Context

Retry Timeout

For full reference on retries you can read the logic here: https://github.com/questdb/c-questdb-client/blob/94b3890e30a5adf10555c61172aaaf8222a8a4a4/questdb-rs/src/ingress/http.rs#L220 This was introduced with this change - it carries a psuedo-code description: https://github.com/questdb/c-questdb-client/issues/51

The short of it is that the retry timeout affects for how long retry reattempts should be started. Each HTTP retry is a request.

Request Timeout

The duration of the request itself is given by the parameters documented here: https://py-questdb-client.readthedocs.io/en/latest/conf.html#http-request i.e. request_timeout + (buffer_size / request_min_throughput) which is implemented here: https://github.com/questdb/c-questdb-client/blob/94b3890e30a5adf10555c61172aaaf8222a8a4a4/questdb-rs/src/ingress/mod.rs#L2567

If this logic is not quite what you want you'd need to write your own loop, but it should work and if you're encountering issues I'd appreciate details so I can replicate them with a test case and fix anything that isn't quite right.

Answers to your questions

If the metrics are not specific to flush then is there currently no control over how long a flush will hang?

There's no direct parameter for this. It's a function of:

I'm seeing retries growing out of control into several minutes with default settings.

That shouldn't be the case.

If this is expected then I would think most users would have to disable retries as several minute hangs on a function call is never ideal.

The request timeout parameters are designed to "scale" for both large and small requests. It is not unheard of for someone to want to flush several hundreds of MiB from a dataframe at once. Such an operation would need a bigger timeout than sending a few hundred rows. The request_min_throughput of 100KiB/s is possibly overly conservative and should be higher by default.

What would my control mechanism be for the single timeout? For example, I want it to try only once for 10 seconds, what exact parameters would I choose?

You should get this by setting:

This would give you two requests for 10 seconds each, or set request_timeout=5000 for 10 seconds in total for the flush (5s for the original request, 5s for the retry).

Either way if this fixes your issue or not, let me know - timeouts can be fiddly and feedback is most appreciated. Specifically I'd be curious to know what your buffer size is.

nicholas-a-guerra commented 7 months ago

I very much appreciate the detailed response. Based on this information, I'm fairly confident there must be a bug then. In my specific use case the buffer grows by 9895 bytes per second, very precisely. But I cleared the entire buffer, if it grew past 10000000 bytes. I've tried many settings before reaching out and your suggestion was essentially already attempted. I specifically tested:

With this and the growing buffer I saw the flush keep failing and each flush would individually take more and more time. I saw over 800 seconds for a single flush during this exact configuration. It would make sense that the growing buffer would cause an increase in time but I thought retry_timeout (default of 10 seconds) would supersede this. Or if not I thought the request_min_throughput=0 would circumvent this calculation -> request_timeout + (buffer_size / request_min_throughput). Based on the output, this is currently not the case. Neither prevent the growth of the flush call.

I also do agree that in some situations it is very feasible for the full transaction to take a very long time for a large buffer. I guess I really mean that it is never ideal to not be able to control this. Right now I have no control to keep the flush function from taking less than 13 minutes.

Thank you for taking time away from your holiday! I don't want to keep you from taking time off. With the auto_flush and this behaving oddly though maybe another teammate could join in? Although, I realize these things take time. Maybe it would be better to go back to the more stable previous client until these items are worked out?

amunra commented 7 months ago

ILP/HTTP is new in 2.x.x.

Instead of downgrading you can temporarily swap over to TCP: The functionality there is largely unchanged.

Note that TCP does not do retries or requests.

I'll take a look at this next week, possibly the second half.

Unless you have production urgency, I recommend just waiting for me to investigate further.

I'll come back to you in case I have more questions.

nicholas-a-guerra commented 7 months ago

That’s true, I’ll probably go back to tcp like you suggested on the same client version. I appreciate your assistance with this. Looking forward to hearing more.

nwoolmer commented 7 months ago

Noting the relevant Slack thread for posterity: https://questdb.slack.com/archives/C1NFJEER0/p1711771491609659

amunra commented 7 months ago

I've instrumented the Rust code (which implements the logic).

For example setting up a sender like so

    let mut sender = server
        .lsb_http()
        .request_timeout(Duration::from_millis(50))?
        .request_min_throughput(0)?
        .retry_timeout(Duration::from_millis(100))?
        .build()?;

against a server that never accepts the incoming HTTP connection logs the following:

Sending request :: Instant { tv_sec: 191948, tv_nsec: 325492375 }
    took 51.201042ms
Entering request retry loop, outstanding: 100ms :: Instant { tv_sec: 191948, tv_nsec: 376776291 }
Sleeping for 12ms / outstanding: 99.905292ms
Retrying request :: Instant { tv_sec: 191948, tv_nsec: 391962708 }
    took 51.152875ms, outstanding: 33.719792ms
Sleeping for 23ms / outstanding: 33.66525ms
Retrying request :: Instant { tv_sec: 191948, tv_nsec: 471223250 }
    took 51.107708ms, outstanding: 0ns
in total took 196.8955ms

The params correctly drive expected behaviour:

I'll now check the Python wrapper (which this issue is raised against) and see if the error is somewhere in the bindings.

Currently, I am yet to replicate your issue.

nicholas-a-guerra commented 7 months ago

Thanks for looking into this. I’m not sure how important these items are to reproduce the problem but when I encountered the problem I also was facing high latency connection/high dropped packet connection and also increasing buffer size.

Maybe these 2 items could somehow be replicated as well when testing specifically with the python client. (I was able to reproduce the bad connection myself by connecting the client to a WiFi AP that was very distant). And lastly, I tested with request_min_throughput set to 0 and default. So maybe adding both iterations of that as well. Looking forward to hearing more!

amunra commented 7 months ago

I believe this is simply a consequence (side-effect) of https://github.com/questdb/py-questdb-client/issues/73.

Version 2.0.0 contains a bug where the last flushed timestamp isn't being maintained correctly. I suspect this causes each row to flush each time. I suspect this would explain the long delays you're seeing since it's causing every row to flush.

nicholas-a-guerra commented 7 months ago

If you take a look at the code I provided, that would not explain the issue. I don't have anywhere in my code anymore a row call on the sender. Because of that error (https://github.com/questdb/py-questdb-client/issues/73) that I discovered, I now have to maintain my own buffer. And I call flush manually with the buffer I build independently. So it is only possible that flush is called on the interval I define. In this case it is once every 10 seconds. I believe if you look at the code I provide it will help clear it up.

nicholas-a-guerra commented 7 months ago

Specially I timed one explicit flush call with all defaults to hang for over 800 seconds. I included the exact code and printout to the logs in the first comment.

nicholas-a-guerra commented 7 months ago

I guess to be precise. This exact one line took over 800 seconds with a default sender:

sender.flush(buffer)

If this is not supposed to be possible then there is a bug because that is exactly what happened. The logs show how I proved that above.

amunra commented 7 months ago

I still have no luck reproducing this. Can you provide me with a script that reproduces the issue?

nicholas-a-guerra commented 7 months ago

Yeah I can do that, I’ll try to provide it today if possible, thanks

nicholas-a-guerra commented 7 months ago

Hi @amunra, I've got a very minimal script that showcases the issue. For reference, I did update to 2.0.2 before testing and tested on multiple computers with identical results. If you run the script exactly as written below with no change, you will see that the time will align with the comment next to each sender object. I know the IP is not a valid one, leave this as is to reproduce the problem. It occurs when there is no/bad connection to the database.

from questdb.ingress import Buffer, Sender, Protocol, TimestampNanos
from time import time_ns, perf_counter

sender = Sender(Protocol.Http, '1.2.3.4', 9000, retry_timeout=1, request_timeout=1,request_min_throughput=1) # 30 seconds
# sender = Sender(Protocol.Http, '1.2.3.4', 9000, request_min_throughput=0) # 60 seconds
# sender = Sender(Protocol.Http, '1.2.3.4', 9000) # 60 seconds
sender.establish()
buffer = Buffer()
buffer.row('table_name',
           symbols={'a':'a','b':'b','c':'c'},
           columns={'d':1,'e':2,'f':3},
           at=TimestampNanos(time_ns()))
start = perf_counter()
try:
    sender.flush(buffer)
except:
    pass
print(perf_counter()-start)

The intention is to be able to control the timeout in some reasonable way. This is a single row which is taking up to 60 seconds. In our design, the buffer is in a list that locks on flush. So a flush that blocks for 60 seconds when the expected time is roughly 0.1 second is beyond a usable range.

The conditions under which it took 822 seconds (the original post) was during extreme high latency connection. There's no way I can effectively reproduce that. So I won't be able to give you just code that reproduces the extremeness of the effect. I believe under those conditions, some responses are being received from the database and in some way that is resetting an internal timer which is extending the 30-60 seconds that you see from the test above.

Let me know if this information make sense or if I need to provide more context. Thanks for the assistance.

amunra commented 7 months ago

I appreciate the thoroughness, thank for this!

I'll take a look next week.

nicholas-a-guerra commented 6 months ago

@amunra checking in to see if there might have been any progress on this?

mtopolnik commented 6 months ago

Hi Nick, I've tried out your snippet, and the first result (30 seconds with all three Sender timeouts set to 1 ms) is explained by the default 30-second connect timeout on the ureq client we're using. While this should be simple to improve, I think it does not explain all the problems you described above.

The other two examples are, I think, due to the same issue, the only difference being that our default timeouts allow a 2nd connection attempt, adding another 30 seconds.

nicholas-a-guerra commented 6 months ago

Understood, getting minimum reproducible code to produce the extended issues described seemed to be difficult for me to do. I believe it had to do with high latency connections. Somehow, getting some response seemed to exacerbate the problem. And recreating this scenario only with code was difficult for me to do. But even with part of the issue identified that's still great news. At least getting that part taken care will make the library much more predictable and therefore usable.