[discussion] (de)serialize data in "C"?

r2evans commented 2 months ago

The notion that redis is storing strings is fine, but R is unique among most languages in that strings can be particularly punishing. When retrieving larger objects (e.g., 1000-row frame), retrieving the JSON (or however it is stringified depending on the creation mechanism) as a string, bring into R-memory, and then deserializing from string can be much less efficient than it strictly needs to be.

What are your thoughts on including (which means writing from scratch, I believe) inline (de)serialization of data?

In my use-case, we have a rather larger cache-in-redis of relatively large amounts of data. The efficiency of in-memory caching of large objects is not my point here (since a partner company is hosting and pushing data to their redis in a cloud). The long-term storage is an arrow datamart, but for many other (non-R) apps they are using redis as a cache. The total dataset is in the millions of rows, but each redis object is a 300-1000 row (70+ column) frame. Just deserializing takes an extra 60MB (300 row frame) above what is actually used once deserialized, and all apps load hundreds of thousands of these frames at once, so 60MB will add up. (R's global string pool.) (For reference, toJSON(dat) strings are between 465K-1553K characters. Not huge, but thousands of these add up.)

Clearly this doesn't need to support every serialization mechanism that can work with redis, but some industry standards might support R's native and JSON.

richfitz commented 1 month ago

Hi - I've been meaning to reply to this, but am a little confused here about what is missing.

When we've been storing data from R into Redis, we typically either store a string that will come back as a string or serialised binary data (usually as rds format via redux::object_to_bin which is just a wrapper around serialize).

Are you imagining/requiring some storage format where the internals of the object are available within redis? This should be fairly "easy" to do - the relevant code is here: https://github.com/richfitz/redux/blob/master/src/conversions.c#L187-L190 so it would just be a case of adding something to change serialisation there, and again on the reverse.

If you are able to cook up a proof-of-concept that shows the problem clearly that would make a good starting point

r2evans commented 1 month ago

It might be under-informed, tbh, or a knee-jerk based on other things I'm doing.

I believe that data is typically stored in redis/valkey in json. If that data is passed over-the-wire to R as string, that's one thing. But if that string is passed into R-space as a string and then passed to jsonlite::fromJSON or some other deserializer, then the global string pool problem comes into play.

Perhaps this is just needing feel-good confirmation for me, then: when serialized data comes back from redis and is deserialized, then we do not pay the price of the string pool, is that correct? (I admit that I might have been up-late and tired when I first drafted this issue; I should have come back to revisit or edit it.)

richfitz commented 1 month ago

Right - I think I'm starting to understand. How much of an issue your problem is depends a lot on how one is using Redis/redux.

If you're in control of the whole pipeline and you are using R for everything at both ends then you can serialise to binary and put that into redis directly. If you do that there is no cost to pay for dealing with strings and nothing involves json at all. This is our typical use (see for example rrq).

If you are using redux to interface with an application where someone else is serialising data into json, and you want to deserialise that data on egress into R then you might benefit from deserialising into R within the C code. Unfortunately, doing that involves all of the usual headaches with deserialising JSON to R (though that's thankfully less terrible than going the other way). If a small subset is handled (e.g. a data.frame stored by row or by column) that's theoretically fine but you can imagine that this path leads to rewriting all jsonlite by hand. It does not offer a C API I believe.

If you want to work up a proof of concept and find out what the headline speed gain could possibly be, that would be great. Alternatively, it's possible that something in Redis' json API might allow you to slice and dice your data before sending it over the wire, which would feel like the ideal solution potentially

r2evans commented 1 month ago

You hit it on the head: sharing the data with non-R clients.

I wasn't aware of the JSON API within redis, though frankly that does not really do much, since our use is to grab "the whole thing" (data partitioning will be handled before storage). Even if we do subset in-redis, we'd still pay a price albeit slightly smaller.

Yes, I was thinking something akin to fromJSON-in-C. Thanks for wading through and getting to that point. I'll look into the jsonlite source and see what jumps out at me ...

(This is further challenged by the difference that R deserialization is rather straight-forward with no "options", we cannot say the same for json objects.)

r2evans commented 1 month ago

I've done some more testing, thank you for your patience in this discussion.

The primary point of my issue is that I want to transfer large-ish data from a redis topic of some sort (hash, time-series, whatever) into an R-friendly object, as efficiently as possible. The clear problem-child here is R's string pool, where retrieving a large string invokes a larger cost than in most languages.

However, binary objects and connections seem to work just fine. For instance,

### msgpack
obj <- RcppMsgPack::msgpack_pack(mtcars[1:3,])
R$SET("quux42", obj)
# [Redis: OK]
R$GET("quux42") |>
  RcppMsgPack::msgpack_unpack() |> 
  # klunky, fragile, only works because we know the data is rectanguler
  with(setNames(as.data.frame(lapply(value, unlist)), unlist(key)))
#    mpg cyl disp  hp drat    wt  qsec vs am gear carb
# 1 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
# 2 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
# 3 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1

But with json-encoded data, whether string or raw, it is always returned as a string.

obj <- jsonlite::toJSON(mtcars[4:6,])
R$SET("quux42", obj)
# [Redis: OK]
R$GET("quux42")
# [1] "[{\"mpg\":21.4,\"cyl\":6,\"disp\":258,\"hp\":110,\"drat\":3.08,\"wt\":3.215,\"qsec\":19.44,\"vs\":1,\"am\":0,\"gear\":3,\"carb\":1,\"_row\":\"Hornet 4 Drive\"},{\"mpg\":18.7,\"cyl\":8,\"disp\":360,\"hp\":175,\"drat\":3.15,\"wt\":3.44,\"qsec\":17.02,\"vs\":0,\"am\":0,\"gear\":3,\"carb\":2,\"_row\":\"Hornet Sportabout\"},{\"mpg\":18.1,\"cyl\":6,\"disp\":225,\"hp\":105,\"drat\":2.76,\"wt\":3.46,\"qsec\":20.22,\"vs\":1,\"am\":0,\"gear\":3,\"carb\":1,\"_row\":\"Valiant\"}]"

obj <- charToRaw(jsonlite::toJSON(mtcars[7:9,]))
head(obj)
# [1] 5b 7b 22 6d 70 67
R$SET("quux42", obj)
# [Redis: OK]
R$GET("quux42")
# [1] "[{\"mpg\":14.3,\"cyl\":8,\"disp\":360,\"hp\":245,\"drat\":3.21,\"wt\":3.57,\"qsec\":15.84,\"vs\":0,\"am\":0,\"gear\":3,\"carb\":4,\"_row\":\"Duster 360\"},{\"mpg\":24.4,\"cyl\":4,\"disp\":146.7,\"hp\":62,\"drat\":3.69,\"wt\":3.19,\"qsec\":20,\"vs\":1,\"am\":0,\"gear\":4,\"carb\":2,\"_row\":\"Merc 240D\"},{\"mpg\":22.8,\"cyl\":4,\"disp\":140.8,\"hp\":95,\"drat\":3.92,\"wt\":3.15,\"qsec\":22.9,\"vs\":1,\"am\":0,\"gear\":4,\"carb\":2,\"_row\":\"Merc 230\"}]"

Is there a way to force GET to return the binary object instead of "knowing" it's text and doing that instead?

richfitz commented 1 month ago

Not at present - here's the heuristic: https://github.com/richfitz/redux/blob/master/src/conversions.c#L228-L255

This gets called from a a bunch of places as we build a list of redis replies (not everything is as simple as a GET; see further up that file for details - it's hopefully fairly easy to follow). So if it's "just" a case of preventing conversion to a string then that's easy enough but working out which fields to do this for remains hard unless you want to do it for all results.

At this point it's an interface issue, and one I don't have a strong idea about. However, it does seem much simpler than trying to do the deserialisation in C!

r2evans commented 1 month ago

I was thinking something deliberate as

R$GET("quux42", as="raw")

or

R$GET("quux42", asraw=TRUE)

The former allows for future expansion, the latter is single-purpose-simple.

I admit that I don't know offhand what other redis verbs could benefit from this, I'm sure it's "some" at least.

richfitz commented 1 month ago

Can you have a look at #61, which adds control over conversion at the lower-level interface. You should be able to see the real-world performance impact of this in your application, and the description shows how to use GET in this way.

r2evans commented 1 month ago

Okay, three (hopefully useful) points on your branch.

First, defaulting to "raw" is a breaking change for $command, I suspect. I suggest as=NULL defaults to "auto".

Second, R$command is still single-argument:

R <- redux::hiredis(port=16379) # as before
R$command(list("GET", "key"), "raw")
# Error in R$command(list("GET", "key"), "raw") : unused argument ("raw")
R$command
# function (cmd) 
# {
#     redis_command(ptr, cmd)
# }
# <bytecode: 0x6435764515b8>
# <environment: 0x6435870ee9e8>

I'm able to get to it directly by using redux:::redis_command and supplying a raw pointer.

R2ptr <- redux:::redis_connect_tcp("localhost", 16379) # is there a better way to get at this from R above?
redux:::redis_command(R2ptr, list("GET", "key"), "raw") # works as documented

Third, some quick benchmarks on (for me) representative data: frames that vary from 500-1400 rows with 74 columns (11 string, 1 POSIXt, the remainder int/float).

r is using R's serialize
json uses jsonlite
msgpack uses RcppMsgPack
*_raw uses the new "raw" mechanism you are introducing in gh-60, the others are using R$GET and pulling into R in the normal way

Functions

```r R2 <- redux::hiredis(port=16379) R2ptr <- redux:::redis_connect_tcp("localhost", 16379) r2frame <- function(key) { obj <- R2$GET(key) redux::bin_to_object(obj) } json2frame <- function(key) { obj <- R2$GET(key) jsonlite::fromJSON(obj) } msgpack2frame <- function(key) { obj <- R2$GET(key) RcppMsgPack::msgpack_unpack(obj, simplify=TRUE) |> lapply(function(z) if (is.list(z) && length(z) > 0 && all(sapply(z, is.null))) rep(NA, length(z)) else z) |> as.data.frame() } r2frame_raw <- function(key) { obj <- redux:::redis_command(R2ptr, list("GET", key), "raw") redux::bin_to_object(obj) } json2frame_raw <- function(key) { obj <- redux:::redis_command(R2ptr, list("GET", key), "raw") jsonlite::fromJSON(rawConnection(obj)) } msgpack2frame_raw <- function(key) { obj <- redux:::redis_command(R2ptr, list("GET", key), "raw") RcppMsgPack::msgpack_unpack(obj, simplify=TRUE) |> # if all(is.na(z)) is true for any column z, then msgpack returns `list(NULL, NULL, ...)`; # for all other columns including mixed-NA, it returns vectors; # this (hasty) code fixes that lapply(function(z) if (is.list(z) && length(z) > 0 && all(sapply(z, is.null))) rep(NA, length(z)) else z) |> as.data.frame() } bench::mark( r = r2frame("r/22/14"), r_raw = r2frame_raw("r/22/14"), json = json2frame("json/22/14"), json_raw = json2frame_raw("json/22/14"), msgpack = msgpack2frame("msgpack/22/14"), msgpack_raw = msgpack2frame_raw("msgpack/22/14"), check = FALSE, min_iterations = 10 ) ```

### redux-1.1.4
# A tibble: 3 × 13
  expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory                 time             gc                
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>                 <list>           <list>            
1 r          895.53µs   1.02ms     966.     1.62MB     0      484     0      501ms <NULL> <Rprofmem [79 × 3]>    <bench_tm [484]> <tibble [484 × 3]>
2 json         87.2ms  92.95ms      10.8    5.11MB     0       10     0      925ms <NULL> <Rprofmem [3,002 × 3]> <bench_tm [10]>  <tibble [10 × 3]> 
3 msgpack      4.23ms   4.63ms     211.     1.66MB     6.60    96     3      454ms <NULL> <Rprofmem [127 × 3]>   <bench_tm [99]>  <tibble [99 × 3]> 

### redux-1.1.5 # gh-60
# A tibble: 6 × 13
  expression       min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory                 time             gc                
  <bch:expr>  <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>                 <list>           <list>            
1 r           914.29µs   1.01ms     973.     1.62MB     0      487     0      501ms <NULL> <Rprofmem [79 × 3]>    <bench_tm [487]> <tibble [487 × 3]>
2 r_raw        868.6µs    1.4ms     747.     1.62MB     0      374     0      501ms <NULL> <Rprofmem [79 × 3]>    <bench_tm [374]> <tibble [374 × 3]>
3 json         84.65ms  89.11ms      11.3    5.11MB     0       10     0      889ms <NULL> <Rprofmem [3,002 × 3]> <bench_tm [10]>  <tibble [10 × 3]> 
4 json_raw     85.36ms  87.02ms      11.5   11.37MB     1.28     9     1      783ms <NULL> <Rprofmem [3,073 × 3]> <bench_tm [10]>  <tibble [10 × 3]> 
5 msgpack       4.15ms   4.34ms     227.     1.66MB     4.16   109     2      480ms <NULL> <Rprofmem [127 × 3]>   <bench_tm [111]> <tibble [111 × 3]>
6 msgpack_raw    4.2ms   4.35ms     225.     1.66MB     6.38   106     3      470ms <NULL> <Rprofmem [127 × 3]>   <bench_tm [109]> <tibble [109 × 3]>

For those benchmarks, I am very surprised that the memory consumption of json is better than json_raw. I don't know if that means I'm doing it wrong, if passing huge strings is not crushing me like I feared it would, or if the raw vectors are somehow otherwise problematic similar to the string pool. (I have confirmed that R$GET(.) behaves the same, and the new command(..) method returns a raw object.)

Either way, your change is a POC for the change, though until I find out why mem_alloc is the same for both json variants, I don't know that gh-60 enables anything.

For the record, R-4.3.3 in emacs/ess on ubuntu-24.04 on linux-6.8.0, 64GB of ram.

r2evans commented 1 month ago

I haven't studied your code, so a naive question: is there a chance the underlying code pulls a string into R and then treats it as raw for the user?

richfitz commented 1 month ago

The default is auto - that was a typo in the PR comment which I've fixed
There was a change uncommitted which fixes $command(), so that should work as described now and simplify your explorations
I do like a good benchmark, thanks for investigating. they are often a bit counterinuitive

I haven't studied your code, so a naive question: is there a chance the underlying code pulls a string into R and then treats it as raw for the user?

No, not within redux - this is the relevant line: https://github.com/richfitz/redux/blob/5d7211f5b4d4570ab37739cdd19c88ebb364e85b/src/conversions.c#L255

This is the same codepath that actual binary data would go through and that definitely does not get converted into a string

r2evans commented 1 month ago

okay ... then it's one of the other things I don't understand. I'll push for our partners to use msgpack instead, it appears to be fairly good here, with the same memory cost as native-R serializers. In the meantime, I'll keep looking at what I'm doing wrong here with jsonlite.

richfitz / redux

[discussion] (de)serialize data in "C"? #60