question about performance

Arnold1 commented 1 year ago

Hi,

i have a question to inserting 1 million key/value into redis. the key is a string of len 30. the value are around 30KB of bytes.

if i have 1 million key/value 30 KB = 30 GB (total). i could create batches of 10K elements 30 KB = 300MB. is there a rule of thump how big the batch size should be? i could try different sizes and measure...

i saw there are 2 options:

redis_client.mset - is an atomic operation
pipe = redis_client.pipeline() and call pipe.set - is not an atomic operation

i assume pipe = redis_client.pipeline() and call pipe.set is a bit faster than redis_client.mset - since its not an atomic operation? but than i need to do the error handling myself etc.

is there another option i missed?

i was reading some infos here: https://redis.io/docs/manual/patterns/bulk-loading/ - is pipe = redis_client.pipeline() and call pipe.set doing the same internally?

steve-mavens commented 1 year ago

I don't speak on behalf of the redis-py project, but:

If you want to compare mset against pipelines, it's probably safest to run them both and measure, rather than trying to guess how much overhead an atomic operation imposes. The performance in any case might depend as much on how your Redis server is provisioned, as it does on redis-py.
Pipelines by default open a transaction, so actually they are atomic: https://redis-py.readthedocs.io/en/stable/advanced_features.html . However there's no rollbacks in Redis, so if part of a transaction fails then the response is a mixture of successes and failures, and the state of Redis is that some of the changes have occurred and others have not. If you're used to SQL then this might not be what you think of as an "atomic transaction", but in that case neither is mset.
Each of your Redis SET commands contains 30KB of data, so I don't think you'll get as much percentage benefit from pipelining them as you would pipelining lots of tiny commands. But it's worth measuring to find out.
You are planning to make the changes in batches, so I assume you don't actually need atomic operations, you just want to do whatever's fast enough. You can set transaction=False on a pipeline to see if it makes a difference.
Whether you are using transactions or mset, the main consideration for the batch size is that each Redis command and each transaction blocks all other Redis operations. So you'll want to consider how responsive you need Redis to be, to all the other requests your application makes. If you don't mind it being unavailable for 1 second at a time, but you don't want 2-second delays, then make sure you don't create Redis commands or transactions that require 2 seconds for Redis to execute them. Time taken will depend on the CPU and other performance details of your specific Redis instance, but Redis is usually faster than you feared it might be!
If you're uploading all this data once when you spin up Redis, before you start your application, then batch size probably doesn't matter, because your uploader is the only thing using Redis. So use massive batches. If it happens hourly while your application continues running, then you probably want the upload to go gently, in smaller batches that won't cause noticeable delays.

The difference between mset and pipeline is whether you want to create one big Redis command per batch of keys processed in a single round-trip to Redis (mset), or several smaller Redis commands per batch of keys (pipeline) that, thanks to the way pipelines work, will still be sent to Redis together. If that sounds like not much of a difference -- indeed, it might not make much difference! I think a pipeline is more flexible, though, since you have the option of sending lots of SET commands together and then not be a single transaction. With mset you're tying together the size of data sent in one go over the network, and Redis's internal locking. There's nothing to stop you doing both: mset N keys at a time, then and bundle M mset calls into a pipeline, for a total of N * M keys per pipeline. But that's probably too many variables to optimise.

Pipelines are somewhat similar to the bulk loading. One optimisation of the bulk loader is that because it takes Redis protocol data as its input, it never even has to "understand" the syntax of its input, or find the boundaries between commands. It just writes bytes from the input to the Redis socket. A pipeline is still serialising one command at a time, converting them to Redis protocol. But it is writing a series of Redis-protocol commands to the socket at once, same as redis-cli --pipe does. I haven't looked at the source for redis-cli for the exact details, but the docs say it starts reading data back from the response while it's still writing: "At the same time it reads data when available, trying to parse it.". redis-py pipelines do not do this, see _execute_pipeline and _execute_transaction in https://github.com/redis/redis-py/blob/master/redis/client.py. First they write all the commands, then they read all the responses, all in the same thread. It's possible this could make a difference in some circumstances, depending how the network and the Redis server are affected by the fact that by the time you've finished writing, there's a lot of data backed up in the response. It could potentially even deadlock, if the server has a limit how much data it will buffer when the output isn't being consumed. But quite possibly the Redis server has no such limit. If it has enough RAM to store your 300GB, then it has approximately enough RAM to store a million small responses too!

Also on bulk loading - part of the point of using a Redis client like redis-py is so that you don't have to understand and use the Redis protocol. But if the redis-py client is taking too long to load your million keys, and you can't optimise it, then bulk loading could be the next thing to try. Bear in mind it requires a separate dependency in your application, since you'll need the redis-cli program installed. Still, it might be worth running this once to give you a target time: if your redis-py code is running close to this time then there's not much room left to optimise your redis-py code, and you can stop.

github-actions[bot] commented 2 weeks ago

This issue is marked stale. It will be closed in 30 days if it is not updated.

redis / redis-py

question about performance #2794