Closed dipterix closed 4 years ago
Generally, comparing serialization speed without considering output size is not very informative, and it depends on what x
is.
Let me demonstrate with an example:
microbenchmark(
qs_uncompressed = pryr::object_size(qserialize(runif(1e7), preset="uncompressed")),
qs_balanced = pryr::object_size(qserialize(runif(1e7), preset="balanced")),
r_serialize = pryr::object_size(serialize(runif(1e7), NULL)), times=5)
Unit: milliseconds
expr min lq mean median uq max neval
qs_uncompressed 449.4853 449.5462 450.4331 450.2453 451.2291 451.6596 5
qs_balanced 478.3065 526.5699 517.7617 527.1873 527.1885 529.5561 5
r_serialize 473.5093 485.2225 485.9473 488.8182 489.4592 492.7274 5
Looks like qs_balanced
is worse, right? But consider the output size:
> pryr::object_size(qserialize(runif(1e7), preset="uncompressed"))
80 MB
>
> pryr::object_size(qserialize(runif(1e7), preset="balanced"))
47.4 MB
>
> pryr::object_size(serialize(runif(1e7), NULL))
80 MB
So serialization speed is not an apples to apples comparison, because qs_balanced
has added in better compression.
For your other questions:
1) No support for mutlti-threading in-memory yet. (I am working on re-writing the multi-threading routines and making it better.)
2) Yes, thanks to how easy Rcpp
makes the process, you can call qs
methods within C++ code. There's an example at the bottom of the readme.
Thanks for the great explanation :) Now I have a better understanding where the use cases are.
Just tried your Rcpp interfaces. It was great and clear!
My apologies for asking so many questions that are not "issues" to the package.
I'm writing a package that requires serialization of multiple objects and concatenate them in RawVectors (consider two raw vectors concatenated to one). Is there anyway that I can get results as std::vector<uint>
instead of Rcpp::RawVector
from qs
directly? (I know you can use as<T>
to cast type, but not sure if that's memory efficient)
A std::memcpy
from RawVector
to std::vector
should be pretty fast (generally 10+ Gb/s), so I wouldn't worry about that bottlenecking performance. It's not particularly memory efficient, but internally, qserialize
writes to a std::vector and memcpy
's to a RawVector
too.
I'm not sure I see the use-case for concatenating two serialized objects together. I don't guarantee that you can de-serialize objects that are concatenated together, so you will want to at least keep information on the sizes of the serialized objects.
Alternatively, I would suggest a std::vector<RawVector>
or if you'd like to keep around multiple serialized objects.
I see. I saw memcpy
somewhere yesterday, but wasn't sure if it's safe to use it in my case. Thanks for confirming it.
I think I got all the information I need. Thanks :)
Here is what I'm trying to do. The goal is to put multiple R objects into one raw vector, wrapped with header/meta information. The reason doing so is because one object is too large (~500MB), and operations in R usually result in copying the object or gc(), which is really time-consuming.
Native R (un)serialize functions provide ways to serialize directly into/from a raw connection. It's quite convenient in a way such that if I concatenate two objects, I'm still able to unserialize it, which gives the first object. If I seek the pointer to the beginning of the second, then unserialize returns the second one.
conn <- rawConnection(raw(0), 'r+a')
serialize(123, conn) # 39 bytes
serialize(234, conn) # also 39 bytes
seek(conn, 0, 'start')
unserialize(conn) # 123
seek(conn, 39, 'start')
unserialize(conn) # 234
Hi, if I made a speed comparison vs default
serialize
. I'm not sure where I did wrong, but the speed seems very close. If there any benefit by doingqserialize
?Random generated ~800KB
Random generated ~8MB
I guess the data is randomly generated and there is no benefit compressing it?
Also, I have several questions
qs
package, is there any interface that I can ply with now?Thanks