qsbase / qs

Quick serialization of R objects
405 stars 19 forks source link

Serialization benchmark #37

Closed dipterix closed 4 years ago

dipterix commented 4 years ago

Hi, if I made a speed comparison vs default serialize. I'm not sure where I did wrong, but the speed seems very close. If there any benefit by doing qserialize?

Random generated ~800KB

image

Random generated ~8MB

image

I guess the data is randomly generated and there is no benefit compressing it?

Also, I have several questions

  1. Is there in-memory multi-threaded serialization method that I could try to serialize array-like objects (split data and serialize for each block automatically)
  2. If I want to write Rcpp functions that uses qs package, is there any interface that I can ply with now?

Thanks

traversc commented 4 years ago

Generally, comparing serialization speed without considering output size is not very informative, and it depends on what x is.

Let me demonstrate with an example:

microbenchmark(
  qs_uncompressed = pryr::object_size(qserialize(runif(1e7), preset="uncompressed")),
   qs_balanced = pryr::object_size(qserialize(runif(1e7), preset="balanced")),
   r_serialize = pryr::object_size(serialize(runif(1e7), NULL)), times=5)
Unit: milliseconds
            expr      min       lq     mean   median       uq      max neval
 qs_uncompressed 449.4853 449.5462 450.4331 450.2453 451.2291 451.6596     5
     qs_balanced 478.3065 526.5699 517.7617 527.1873 527.1885 529.5561     5
     r_serialize 473.5093 485.2225 485.9473 488.8182 489.4592 492.7274     5

Looks like qs_balanced is worse, right? But consider the output size:

> pryr::object_size(qserialize(runif(1e7), preset="uncompressed"))
80 MB
> 
> pryr::object_size(qserialize(runif(1e7), preset="balanced"))
47.4 MB
> 
> pryr::object_size(serialize(runif(1e7), NULL))
80 MB

So serialization speed is not an apples to apples comparison, because qs_balanced has added in better compression.

For your other questions:

1) No support for mutlti-threading in-memory yet. (I am working on re-writing the multi-threading routines and making it better.)

2) Yes, thanks to how easy Rcpp makes the process, you can call qs methods within C++ code. There's an example at the bottom of the readme.

dipterix commented 4 years ago

Thanks for the great explanation :) Now I have a better understanding where the use cases are.

Just tried your Rcpp interfaces. It was great and clear!

My apologies for asking so many questions that are not "issues" to the package. I'm writing a package that requires serialization of multiple objects and concatenate them in RawVectors (consider two raw vectors concatenated to one). Is there anyway that I can get results as std::vector<uint> instead of Rcpp::RawVector from qs directly? (I know you can use as<T> to cast type, but not sure if that's memory efficient)

traversc commented 4 years ago

A std::memcpy from RawVector to std::vector should be pretty fast (generally 10+ Gb/s), so I wouldn't worry about that bottlenecking performance. It's not particularly memory efficient, but internally, qserialize writes to a std::vector and memcpy's to a RawVector too.

I'm not sure I see the use-case for concatenating two serialized objects together. I don't guarantee that you can de-serialize objects that are concatenated together, so you will want to at least keep information on the sizes of the serialized objects.

Alternatively, I would suggest a std::vector<RawVector> or if you'd like to keep around multiple serialized objects.

dipterix commented 4 years ago

I see. I saw memcpy somewhere yesterday, but wasn't sure if it's safe to use it in my case. Thanks for confirming it.

I think I got all the information I need. Thanks :)

Here is what I'm trying to do. The goal is to put multiple R objects into one raw vector, wrapped with header/meta information. The reason doing so is because one object is too large (~500MB), and operations in R usually result in copying the object or gc(), which is really time-consuming.

Native R (un)serialize functions provide ways to serialize directly into/from a raw connection. It's quite convenient in a way such that if I concatenate two objects, I'm still able to unserialize it, which gives the first object. If I seek the pointer to the beginning of the second, then unserialize returns the second one.

conn <- rawConnection(raw(0), 'r+a')
serialize(123, conn) # 39 bytes
serialize(234, conn) # also 39 bytes

seek(conn, 0, 'start')
unserialize(conn)  # 123

seek(conn, 39, 'start')
unserialize(conn)  # 234