Open r2evans opened 2 months ago
Hi - I've been meaning to reply to this, but am a little confused here about what is missing.
When we've been storing data from R into Redis, we typically either store a string that will come back as a string or serialised binary data (usually as rds format via redux::object_to_bin
which is just a wrapper around serialize
).
Are you imagining/requiring some storage format where the internals of the object are available within redis? This should be fairly "easy" to do - the relevant code is here: https://github.com/richfitz/redux/blob/master/src/conversions.c#L187-L190 so it would just be a case of adding something to change serialisation there, and again on the reverse.
If you are able to cook up a proof-of-concept that shows the problem clearly that would make a good starting point
It might be under-informed, tbh, or a knee-jerk based on other things I'm doing.
I believe that data is typically stored in redis/valkey in json. If that data is passed over-the-wire to R as string, that's one thing. But if that string is passed into R-space as a string and then passed to jsonlite::fromJSON
or some other deserializer, then the global string pool problem comes into play.
Perhaps this is just needing feel-good confirmation for me, then: when serialized data comes back from redis and is deserialized, then we do not pay the price of the string pool, is that correct? (I admit that I might have been up-late and tired when I first drafted this issue; I should have come back to revisit or edit it.)
Right - I think I'm starting to understand. How much of an issue your problem is depends a lot on how one is using Redis/redux.
If you're in control of the whole pipeline and you are using R for everything at both ends then you can serialise to binary and put that into redis directly. If you do that there is no cost to pay for dealing with strings and nothing involves json at all. This is our typical use (see for example rrq).
If you are using redux to interface with an application where someone else is serialising data into json, and you want to deserialise that data on egress into R then you might benefit from deserialising into R within the C code. Unfortunately, doing that involves all of the usual headaches with deserialising JSON to R (though that's thankfully less terrible than going the other way). If a small subset is handled (e.g. a data.frame stored by row or by column) that's theoretically fine but you can imagine that this path leads to rewriting all jsonlite by hand. It does not offer a C API I believe.
If you want to work up a proof of concept and find out what the headline speed gain could possibly be, that would be great. Alternatively, it's possible that something in Redis' json API might allow you to slice and dice your data before sending it over the wire, which would feel like the ideal solution potentially
You hit it on the head: sharing the data with non-R clients.
I wasn't aware of the JSON API within redis, though frankly that does not really do much, since our use is to grab "the whole thing" (data partitioning will be handled before storage). Even if we do subset in-redis, we'd still pay a price albeit slightly smaller.
Yes, I was thinking something akin to fromJSON
-in-C. Thanks for wading through and getting to that point. I'll look into the jsonlite
source and see what jumps out at me ...
(This is further challenged by the difference that R deserialization is rather straight-forward with no "options", we cannot say the same for json objects.)
I've done some more testing, thank you for your patience in this discussion.
The primary point of my issue is that I want to transfer large-ish data from a redis topic of some sort (hash, time-series, whatever) into an R-friendly object, as efficiently as possible. The clear problem-child here is R's string pool, where retrieving a large string invokes a larger cost than in most languages.
However, binary objects and connections seem to work just fine. For instance,
### msgpack
obj <- RcppMsgPack::msgpack_pack(mtcars[1:3,])
R$SET("quux42", obj)
# [Redis: OK]
R$GET("quux42") |>
RcppMsgPack::msgpack_unpack() |>
# klunky, fragile, only works because we know the data is rectanguler
with(setNames(as.data.frame(lapply(value, unlist)), unlist(key)))
# mpg cyl disp hp drat wt qsec vs am gear carb
# 1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
# 2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
# 3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
But with json-encoded data, whether string or raw, it is always returned as a string.
obj <- jsonlite::toJSON(mtcars[4:6,])
R$SET("quux42", obj)
# [Redis: OK]
R$GET("quux42")
# [1] "[{\"mpg\":21.4,\"cyl\":6,\"disp\":258,\"hp\":110,\"drat\":3.08,\"wt\":3.215,\"qsec\":19.44,\"vs\":1,\"am\":0,\"gear\":3,\"carb\":1,\"_row\":\"Hornet 4 Drive\"},{\"mpg\":18.7,\"cyl\":8,\"disp\":360,\"hp\":175,\"drat\":3.15,\"wt\":3.44,\"qsec\":17.02,\"vs\":0,\"am\":0,\"gear\":3,\"carb\":2,\"_row\":\"Hornet Sportabout\"},{\"mpg\":18.1,\"cyl\":6,\"disp\":225,\"hp\":105,\"drat\":2.76,\"wt\":3.46,\"qsec\":20.22,\"vs\":1,\"am\":0,\"gear\":3,\"carb\":1,\"_row\":\"Valiant\"}]"
obj <- charToRaw(jsonlite::toJSON(mtcars[7:9,]))
head(obj)
# [1] 5b 7b 22 6d 70 67
R$SET("quux42", obj)
# [Redis: OK]
R$GET("quux42")
# [1] "[{\"mpg\":14.3,\"cyl\":8,\"disp\":360,\"hp\":245,\"drat\":3.21,\"wt\":3.57,\"qsec\":15.84,\"vs\":0,\"am\":0,\"gear\":3,\"carb\":4,\"_row\":\"Duster 360\"},{\"mpg\":24.4,\"cyl\":4,\"disp\":146.7,\"hp\":62,\"drat\":3.69,\"wt\":3.19,\"qsec\":20,\"vs\":1,\"am\":0,\"gear\":4,\"carb\":2,\"_row\":\"Merc 240D\"},{\"mpg\":22.8,\"cyl\":4,\"disp\":140.8,\"hp\":95,\"drat\":3.92,\"wt\":3.15,\"qsec\":22.9,\"vs\":1,\"am\":0,\"gear\":4,\"carb\":2,\"_row\":\"Merc 230\"}]"
Is there a way to force GET
to return the binary object instead of "knowing" it's text and doing that instead?
Not at present - here's the heuristic: https://github.com/richfitz/redux/blob/master/src/conversions.c#L228-L255
This gets called from a a bunch of places as we build a list of redis replies (not everything is as simple as a GET
; see further up that file for details - it's hopefully fairly easy to follow). So if it's "just" a case of preventing conversion to a string then that's easy enough but working out which fields to do this for remains hard unless you want to do it for all results.
At this point it's an interface issue, and one I don't have a strong idea about. However, it does seem much simpler than trying to do the deserialisation in C!
I was thinking something deliberate as
R$GET("quux42", as="raw")
or
R$GET("quux42", asraw=TRUE)
The former allows for future expansion, the latter is single-purpose-simple.
I admit that I don't know offhand what other redis verbs could benefit from this, I'm sure it's "some" at least.
Can you have a look at #61, which adds control over conversion at the lower-level interface. You should be able to see the real-world performance impact of this in your application, and the description shows how to use GET
in this way.
Okay, three (hopefully useful) points on your branch.
First, defaulting to "raw"
is a breaking change for $command
, I suspect. I suggest as=NULL
defaults to "auto"
.
Second, R$command
is still single-argument:
R <- redux::hiredis(port=16379) # as before
R$command(list("GET", "key"), "raw")
# Error in R$command(list("GET", "key"), "raw") : unused argument ("raw")
R$command
# function (cmd)
# {
# redis_command(ptr, cmd)
# }
# <bytecode: 0x6435764515b8>
# <environment: 0x6435870ee9e8>
I'm able to get to it directly by using redux:::redis_command
and supplying a raw pointer.
R2ptr <- redux:::redis_connect_tcp("localhost", 16379) # is there a better way to get at this from R above?
redux:::redis_command(R2ptr, list("GET", "key"), "raw") # works as documented
Third, some quick benchmarks on (for me) representative data: frames that vary from 500-1400 rows with 74 columns (11 string, 1 POSIXt, the remainder int/float).
r
is using R's serializejson
uses jsonlitemsgpack
uses RcppMsgPack
*_raw
uses the new "raw"
mechanism you are introducing in gh-60
, the others are using R$GET
and pulling into R in the normal way### redux-1.1.4
# A tibble: 3 × 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
1 r 895.53µs 1.02ms 966. 1.62MB 0 484 0 501ms <NULL> <Rprofmem [79 × 3]> <bench_tm [484]> <tibble [484 × 3]>
2 json 87.2ms 92.95ms 10.8 5.11MB 0 10 0 925ms <NULL> <Rprofmem [3,002 × 3]> <bench_tm [10]> <tibble [10 × 3]>
3 msgpack 4.23ms 4.63ms 211. 1.66MB 6.60 96 3 454ms <NULL> <Rprofmem [127 × 3]> <bench_tm [99]> <tibble [99 × 3]>
### redux-1.1.5 # gh-60
# A tibble: 6 × 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
1 r 914.29µs 1.01ms 973. 1.62MB 0 487 0 501ms <NULL> <Rprofmem [79 × 3]> <bench_tm [487]> <tibble [487 × 3]>
2 r_raw 868.6µs 1.4ms 747. 1.62MB 0 374 0 501ms <NULL> <Rprofmem [79 × 3]> <bench_tm [374]> <tibble [374 × 3]>
3 json 84.65ms 89.11ms 11.3 5.11MB 0 10 0 889ms <NULL> <Rprofmem [3,002 × 3]> <bench_tm [10]> <tibble [10 × 3]>
4 json_raw 85.36ms 87.02ms 11.5 11.37MB 1.28 9 1 783ms <NULL> <Rprofmem [3,073 × 3]> <bench_tm [10]> <tibble [10 × 3]>
5 msgpack 4.15ms 4.34ms 227. 1.66MB 4.16 109 2 480ms <NULL> <Rprofmem [127 × 3]> <bench_tm [111]> <tibble [111 × 3]>
6 msgpack_raw 4.2ms 4.35ms 225. 1.66MB 6.38 106 3 470ms <NULL> <Rprofmem [127 × 3]> <bench_tm [109]> <tibble [109 × 3]>
For those benchmarks, I am very surprised that the memory consumption of json
is better than json_raw
. I don't know if that means I'm doing it wrong, if passing huge strings is not crushing me like I feared it would, or if the raw vectors are somehow otherwise problematic similar to the string pool. (I have confirmed that R$GET(.)
behaves the same, and the new command(..)
method returns a raw object.)
Either way, your change is a POC for the change, though until I find out why mem_alloc
is the same for both json variants, I don't know that gh-60
enables anything.
For the record, R-4.3.3 in emacs/ess on ubuntu-24.04 on linux-6.8.0, 64GB of ram.
I haven't studied your code, so a naive question: is there a chance the underlying code pulls a string into R and then treats it as raw for the user?
$command()
, so that should work as described now and simplify your explorationsI haven't studied your code, so a naive question: is there a chance the underlying code pulls a string into R and then treats it as raw for the user?
No, not within redux - this is the relevant line: https://github.com/richfitz/redux/blob/5d7211f5b4d4570ab37739cdd19c88ebb364e85b/src/conversions.c#L255
This is the same codepath that actual binary data would go through and that definitely does not get converted into a string
okay ... then it's one of the other things I don't understand. I'll push for our partners to use msgpack instead, it appears to be fairly good here, with the same memory cost as native-R serializers. In the meantime, I'll keep looking at what I'm doing wrong here with jsonlite.
The notion that redis is storing strings is fine, but R is unique among most languages in that strings can be particularly punishing. When retrieving larger objects (e.g., 1000-row frame), retrieving the JSON (or however it is stringified depending on the creation mechanism) as a string, bring into R-memory, and then deserializing from string can be much less efficient than it strictly needs to be.
What are your thoughts on including (which means writing from scratch, I believe) inline (de)serialization of data?
In my use-case, we have a rather larger cache-in-redis of relatively large amounts of data. The efficiency of in-memory caching of large objects is not my point here (since a partner company is hosting and pushing data to their redis in a cloud). The long-term storage is an arrow datamart, but for many other (non-R) apps they are using redis as a cache. The total dataset is in the millions of rows, but each redis object is a 300-1000 row (70+ column) frame. Just deserializing takes an extra 60MB (300 row frame) above what is actually used once deserialized, and all apps load hundreds of thousands of these frames at once, so 60MB will add up. (R's global string pool.) (For reference,
toJSON(dat)
strings are between 465K-1553K characters. Not huge, but thousands of these add up.)Clearly this doesn't need to support every serialization mechanism that can work with redis, but some industry standards might support R's native and JSON.