redis / redis-py

Redis Python client
MIT License
12.54k stars 2.51k forks source link

question about performance #2794

Open Arnold1 opened 1 year ago

Arnold1 commented 1 year ago

Hi,

i have a question to inserting 1 million key/value into redis. the key is a string of len 30. the value are around 30KB of bytes.

if i have 1 million key/value 30 KB = 30 GB (total). i could create batches of 10K elements 30 KB = 300MB. is there a rule of thump how big the batch size should be? i could try different sizes and measure...

i saw there are 2 options:

i assume pipe = redis_client.pipeline() and call pipe.set is a bit faster than redis_client.mset - since its not an atomic operation? but than i need to do the error handling myself etc.

is there another option i missed?

i was reading some infos here: https://redis.io/docs/manual/patterns/bulk-loading/ - is pipe = redis_client.pipeline() and call pipe.set doing the same internally?

steve-mavens commented 1 year ago

I don't speak on behalf of the redis-py project, but:

The difference between mset and pipeline is whether you want to create one big Redis command per batch of keys processed in a single round-trip to Redis (mset), or several smaller Redis commands per batch of keys (pipeline) that, thanks to the way pipelines work, will still be sent to Redis together. If that sounds like not much of a difference -- indeed, it might not make much difference! I think a pipeline is more flexible, though, since you have the option of sending lots of SET commands together and then not be a single transaction. With mset you're tying together the size of data sent in one go over the network, and Redis's internal locking. There's nothing to stop you doing both: mset N keys at a time, then and bundle M mset calls into a pipeline, for a total of N * M keys per pipeline. But that's probably too many variables to optimise.

Pipelines are somewhat similar to the bulk loading. One optimisation of the bulk loader is that because it takes Redis protocol data as its input, it never even has to "understand" the syntax of its input, or find the boundaries between commands. It just writes bytes from the input to the Redis socket. A pipeline is still serialising one command at a time, converting them to Redis protocol. But it is writing a series of Redis-protocol commands to the socket at once, same as redis-cli --pipe does. I haven't looked at the source for redis-cli for the exact details, but the docs say it starts reading data back from the response while it's still writing: "At the same time it reads data when available, trying to parse it.". redis-py pipelines do not do this, see _execute_pipeline and _execute_transaction in https://github.com/redis/redis-py/blob/master/redis/client.py. First they write all the commands, then they read all the responses, all in the same thread. It's possible this could make a difference in some circumstances, depending how the network and the Redis server are affected by the fact that by the time you've finished writing, there's a lot of data backed up in the response. It could potentially even deadlock, if the server has a limit how much data it will buffer when the output isn't being consumed. But quite possibly the Redis server has no such limit. If it has enough RAM to store your 300GB, then it has approximately enough RAM to store a million small responses too!

Also on bulk loading - part of the point of using a Redis client like redis-py is so that you don't have to understand and use the Redis protocol. But if the redis-py client is taking too long to load your million keys, and you can't optimise it, then bulk loading could be the next thing to try. Bear in mind it requires a separate dependency in your application, since you'll need the redis-cli program installed. Still, it might be worth running this once to give you a target time: if your redis-py code is running close to this time then there's not much room left to optimise your redis-py code, and you can stop.

github-actions[bot] commented 2 weeks ago

This issue is marked stale. It will be closed in 30 days if it is not updated.