Closed ccaputo closed 4 years ago
Thanks for sharing this data. Sounds like the focus should first be on the !!
queries, as they are the bulk of the performance impact.
Background on this: 4.0 was multi-threaded. Threads are cheap, and share their memory. IRRd has a preload store which contains the data to answer !g/6
queries, and therefore the second half of !a
queries, as they are a combination of !i
and !g/6
. The preload store was an in-memory dictionary, which is really fast and allowed when using threads.
However, in a single Python process only a single thread can be active at the same time, i.e. all of IRRd 4.0 is capped to a single CPU core. Especially with adding RPKI support, the concurrent CPU usage is just too high for that, so 4.1 is multiprocess. This means concurrent processes in IRRd don't fight over CPU time anymore.
Multiprocessing makes sharing a dict in memory quite painful, so this data is now stored in Redis. This has a slightly higher latency, and this is some of the impact you're seeing. There is a mitigation for this: once a query handler has seen more than 5 queries that benefit from preloading on the same connection, it'll load the entire Redis store into local memory, which is much faster, with the expectation that more will follow and it's worth to spend time loading the data.
So performance with concurrent processes (including mirroring) is improved, but with a small latency cost.
Are you already using a unix socket to connect to Redis? TCP connections to Redis are supported, but have a much higher latency.
On why the !!
queries are slower, which is the most significant issue in your use case, I'm puzzled. My first thought was a delay caused by process management, but the elapsed time that is logged purely concerns the handling of that specific query - not the initialisation of the query parser or anything that precedes it. I will dig deeper.
@ccaputo follow up question: are you using a unix socket for PostgreSQL, or TCP?
It looks like part of the delay may be connecting to the SQL database, which 4.1 does upon processing the first query, hence it affecting the !!
response time. Previously we used connection pooling, but those pools can not be shared across processes.
Are you already using a unix socket to connect to Redis? TCP connections to Redis are supported, but have a much higher latency.
@ccaputo follow up question: are you using a unix socket for PostgreSQL, or TCP?
I am using a unix socket for both:
irrd:
database_url: 'postgresql://[...]@/irrd'
redis_url: 'unix:///tmp/redis.sock'
It looks like part of the delay may be connecting to the SQL database, which 4.1 does upon processing the first query, hence it affecting the
!!
response time. Previously we used connection pooling, but those pools can not be shared across processes.
Since !!
along with the client version command !n
are common startup commands that don't involve database lookups, could they along with any other unsophisticated immediately processable commands, be handled in the main TCP handler while for !!
concurrently/asynchronously triggering any needed redis/postgres connections due to the expectation of subsequent commands?
In case useful, this is what multiple unsophisticated commands in a row in a single TCP session look like:
2020-06-30 15:09:15,152 irrd[11379]: [irrd.server.whois.server#INFO] 2001:db8::2:41728: sent answer to query, elapsed 0.02561289630830288s, 0 bytes: !!
2020-06-30 15:09:16,165 irrd[11379]: [irrd.server.whois.server#INFO] 2001:db8::2:41728: sent answer to query, elapsed 0.004909427836537361s, 0 bytes: !!
2020-06-30 15:09:17,311 irrd[11379]: [irrd.server.whois.server#INFO] 2001:db8::2:41728: sent answer to query, elapsed 0.004914136603474617s, 0 bytes: !!
2020-06-30 15:09:18,246 irrd[11379]: [irrd.server.whois.server#INFO] 2001:db8::2:41728: sent answer to query, elapsed 0.005097845569252968s, 0 bytes: !!
2020-06-30 15:09:37,438 irrd[11379]: [irrd.server.whois.server#INFO] 2001:db8::2:41728: sent answer to query, elapsed 0.004980882629752159s, 2 bytes: !nfoo
2020-06-30 15:09:45,628 irrd[11379]: [irrd.server.whois.server#INFO] 2001:db8::2:41728: sent answer to query, elapsed 0.005041765049099922s, 2 bytes: !nfoo
Note that the !!
after the first one takes around 0.0049 seconds, which is still 25 times slower than the 0.0002 seconds it takes with 4.0.8, thus it would still add 1.8 minutes to the ~21.6k bgpq4 connections mentioned above, but maybe if short-circuit handled in the TCP handler, that would go away.
Thank you.
Since !! along with the client version command !n are common startup commands that don't involve database lookups, could they along with any other unsophisticated immediately processable commands, be handled in the main TCP handler while for !! concurrently/asynchronously triggering any needed redis/postgres connections due to the expectation of subsequent commands?
I think for most use cases this would not have a significant effect. Right after someone does !! they'll likely send a query that uses the database, which means the delay will just move to that next query.
The approach I'm thinking of now is to use a pool of worker processes instead. These can keep a local cache of the preload data in-memory, and will already have a database connection. So this eliminates almost all of the forking, database connection, and Redis latency. It will increase the idle load a bit, but I don't expect that to have any negative impact. The upside of this approach is that it will also be useful for the HTTP API. However, it's somewhat complex, so I'm going to do some experimentation and see what the options are.
Since !! along with the client version command !n are common startup commands that don't involve database lookups, could they along with any other unsophisticated immediately processable commands, be handled in the main TCP handler while for !! concurrently/asynchronously triggering any needed redis/postgres connections due to the expectation of subsequent commands?
I think for most use cases this would not have a significant effect. Right after someone does !! they'll likely send a query that uses the database, which means the delay will just move to that next query.
The approach I'm thinking of now is to use a pool of worker processes instead. These can keep a local cache of the preload data in-memory, and will already have a database connection. So this eliminates almost all of the forking, database connection, and Redis latency. It will increase the idle load a bit, but I don't expect that to have any negative impact. The upside of this approach is that it will also be useful for the HTTP API. However, it's somewhat complex, so I'm going to do some experimentation and see what the options are.
Sounds good!
Just wanted to share some thoughts, not sure that's what you had in mind already... Re:
The approach I'm thinking of now is to use a pool of worker processes instead. These can keep a local cache of the preload data in-memory, and will already have a database connection.
A typical good approach for this would be using a pool of worker processes, as you mentioned, where 1 (or more) are specifically designated to keep the cache data in-memory and the database connection, while the rest handle the queries. The query serving processes communicate via a bi-drectional multi-processing queue with the cache worker(s) (or even a ZMQ IPC bus, which I personally like very much, but would add another dependency) to exchange data. This would avoid having many database connections.
Whatever the architecture you decide on, I'd give a big 👍 to anything that avoids Redis for this particular use case (although a big fan otherwise). Cheers!
A typical good approach for this would be using a pool of worker processes, as you mentioned, where 1 (or more) are specifically designated to keep the cache data in-memory and the database connection, while the rest handle the queries. The query serving processes communicate via a bi-drectional multi-processing queue with the cache worker(s) (or even a ZMQ IPC bus, which I personally like very much, but would add another dependency) to exchange data. This would avoid having many database connections.
I am reluctant about this, because I'm concerned every kind of intermediary will actually add latency. Even something as basic as multiprocessing Pipe seems to easily add 100-200 microseconds, which is just too long.
This works and it's fast. The downside is that the part of in-memory preloading to avoid Redis uses a lot of memory, around 400 MB per long-running worker. So if you use the default of 50 simultaneous connections, you're looking at 20 GB memory use just for the whois server.
Thoughts on moving forward:
Even something as basic as multiprocessing Pipe seems to easily add 100-200 microseconds, which is just too long.
I don't find it that long. If that's the price paid for not having to allocate a huge amount of resources just for running a whois server (yes, I do find 20GB huge, no exaggeration, honestly), I'll take that: I prefer to have my server running at a decent 1-2GB (as it's currently running requiring in my setup) and return my queries in 0.102s instead of 0.100s - and save the rest 18GB for other apps. 🙂
Dear Mircea, can you describe your deployment and purpose for the software a bit more? Which organization is this for? It is possible you are using irrd for a different objective than what others in this thread are using irrd for. Irrd v4 is pretty young software and we don't know the user base well yet.
In some deployments (such as NTT's) it is considered acceptable to burn memory like there is no tomorrow. We have to turn it all up to eleven because RR.NTT.NET is one of the worlds largest and busiest irrd servers.
Perhaps in the future we can add a button to make it easier for the operator to choose whether to optimise for speed or memory.
Hi @job. We (DigitalOcean) are using IRRd for our internal whois server, currently only mirroring several sources (so no export at the moment). We're using the local whois server mostly for !g
, !6
or !i
queries (for generating IRR-based prefix lists), which I understand is also the use-case reported in this thread.
I've noticed that the memory usage was around 14GB during the initial full import, afterwards settled to around 1.5GB which I'm very happy with. That said, if I were to upgrade to IRRd 4.1.0, jumping to over 20GB for the same functionality would be a little bit difficult to justify. (That is, without having to reduce the performances by decreasing the number of workers -- which, based on @mxsasha's clarification above, would need to be reduced to 4 workers for the equivalent resource allocation as of now).
Perhaps in the future we can add a button to make it easier for the operator to choose whether to optimise for speed or memory.
Could be an alternative, for sure, although I worry that might overly complicate the implementation... either way totally your call, just wanted to share my perspective as an user, that I'd easily sacrifice a total of 2 seconds for every 10000 queries, rather than having to allocate several times more resources.
You may want to look at the !a
query (this is what bgpq4 uses) to speed things up a bit.
I was actually able to save more memory than initially expected. The workers are now at 162MB in my test setup. That's low enough that we don't need to support separate modes for different priorities. If you want to run IRRd with less memory consumption, you can set a lower server.whois.max_connections
. I have also updated the default for this setting to 10, so that by default, the whois workers will use around 1.6 GB (with the current codebase).
I think this is a reasonable balance - if you want to run IRRd in low memory setups, a maximum of 10 connections seems reasonable. Because that usually pairs with having limited CPU cores available, which means you won't be able to benefit from many simultaneous queries anyways.
I believe this is fixed in https://github.com/irrdnet/irrd/commit/027f7806d47e084dc799d413c55bc980cac181b2 - it will be included in the next beta release.
Initial speed tests post 027f7806d47e084dc799d413c55bc980cac181b2 are very promising!:
2020-07-06 16:40:58,178 irrd[8652]: [irrd.server.whois.server#INFO] 2001:db8::2:37942: sent answer to query, elapsed 0.00015023350715637207s, 0 bytes: !!
2020-07-06 16:51:42,169 irrd[8700]: [irrd.server.whois.server#INFO] 2001:db8::2:39024: sent answer to query, elapsed 0.00012947618961334229s, 0 bytes: !!
2020-07-06 16:56:36,800 irrd[8612]: [irrd.server.whois.server#INFO] 2001:db8::2:39440: sent answer to query, elapsed 0.00010169669985771179s, 0 bytes: !!
2020-07-06 16:57:20,684 irrd[8624]: [irrd.server.whois.server#INFO] 2001:db8::2:39524: sent answer to query, elapsed 0.00015059858560562134s, 0 bytes: !!
2020-07-06 16:57:29,200 irrd[8626]: [irrd.server.whois.server#INFO] 2001:db8::2:39530: sent answer to query, elapsed 0.0001093689352273941s, 0 bytes: !!
2020-07-06 16:59:46,723 irrd[8645]: [irrd.server.whois.server#INFO] 2001:db8::2:39744: sent answer to query, elapsed 0.00014530867338180542s, 0 bytes: !!
2020-07-06 17:00:57,210 irrd[8656]: [irrd.server.whois.server#INFO] 2001:db8::2:39840: sent answer to query, elapsed 0.00014537200331687927s, 0 bytes: !!
2020-07-06 17:07:01,912 irrd[8682]: [irrd.server.whois.server#INFO] 2001:db8::2:40408: sent answer to query, elapsed 0.00020211189985275269s, 0 bytes: !!
Currently blocked by https://github.com/irrdnet/irrd/issues/347.
I am seeing some lovely performance improvement with feb2cc7da759d7289ac04f71b3af060347411a5e.
In #323 I mentioned a performance decrease going from 4.0.8 to 4.1.0b4:
NOTE: I have tested the quality of the results, with respect to the many thousands of queries we perform to inform our route servers. While the data with 4.1.0b4 is good as compared to 4.0.8, I am finding 4.1.0b4 to be considerably slower. As an example, we perform about 12k !gas## and 9.6k !6as### queries every hour. With 4.0.8 this process takes under 4 minutes, but with 4.1.0b4 it is taking over 14 minutes. Any idea as to why the slowdown?
That same task now takes about 3 minutes. Excellent!
Curious about and wanting to bang on server.whois.max_connections
, as a test I set it to 1 and ran four of our route server scripts at once, resulting in many bgpq4 queries contending for a single TCP port. It all worked great. I also tested with settings of 2, 3, 4, and 8. All good.
Amazing work and thanks.
In https://github.com/irrdnet/irrd/issues/323 I mentioned a performance decrease going from 4.0.8 to 4.1.0b4:
Roughly:
!!
has gone from as low as 0.0002 secs to as low as 0.0243 secs, or about a 121x slowdown. Since bgpq4 performs a!!
upon each connection, this performance decrease appears to account for about 8.7 minutes of the change from <4 minutes to >14 minutes, since it is performed some ~21.6k times.Roughly:
!gas6456
has gone from as low as 0.0002 secs to as low as 0.0044 secs, or about a 22x slowdown.Roughly:
!a4AS-PCH
has gone from as low as 0.0114 secs to as low as 0.0210 secs, or about a 1.8x slowdown.Details below...
4.0.8:
4.1.0b4:
4.0.8:
4.1.0b4:
4.0.8:
4.1.0b4: