qsbase / qs

Quick serialization of R objects
400 stars 19 forks source link

Writing / reading to / from file descriptor or memory directly #12

Closed gaborcsardi closed 4 years ago

gaborcsardi commented 5 years ago

Do you think it would be possible to add support for this? It would be great to be able to use a pipe/socket and also memory directly.

traversc commented 5 years ago

Yep, I was looking at pipes for the next version :)

gaborcsardi commented 5 years ago

Great! For my use case the best would be to be able to write to a file descriptor, or HANDLE on Windows.

On Wed, 10 Jul 2019, 19:05 Travers, notifications@github.com wrote:

Yep, I was looking at pipes for the next version :)

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/traversc/qs/issues/12?email_source=notifications&email_token=AAFBGQAPRBGSHHY46E3PTLDP6YJHDA5CNFSM4H7PHXQ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZUDVGI#issuecomment-510147225, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFBGQG6TG4Z4EOX6B5PQM3P6YJHDANCNFSM4H7PHXQQ .

traversc commented 5 years ago

I've added two new functions, qsave_pipe and qread_pipe for writing to file descriptors or R connections.

Writing to R connections seems to be un-allowed by CRAN normally e.g. https://github.com/tidyverse/readr/issues/856#issuecomment-391787058, but can be enabled when compiling).

I'm going to test it out a bit more and submit CRAN.

gaborcsardi commented 5 years ago

Thanks! Unfortunately for my use case R connections are not very good, just a simple Unix fd or a Windows HANDLE would be much better.

traversc commented 5 years ago

I have this set up in two ways -- one way using R connections, the other way using FILE pointers created by popen from cstdio.h. So for example, you could do this:

> qsave_pipe(1:10, "cat > C:/temp.qc") # cat.exe comes from Rtools installation
> qread_pipe("cat C:/temp.qc")
 [1]  1  2  3  4  5  6  7  8  9 10

On the C++ side, this looks something like this:

  std::unique_ptr<FILE, decltype(&pclose)> pipe(popen(scon.c_str(), "wb"), pclose);
  if (!pipe) {
    throw std::runtime_error("popen() failed!");
  }
  FILE * con = pipe.get();
  fwrite(data, 1, length, con);
 ...

Is that what you had in mind? Working with windows handles or even unix fd's (which I understand are wrapped by FILE * pointers) are a bit beyond my current expertise, and Google isn't being particularly helpful. But I am happy to learn if you could give some tips or pointers on implementation.

gaborcsardi commented 5 years ago

Thanks! Well, almost. :) FILE * is still too difficult, it has its own buffering, etc.

The best for us would be file descriptors, i.e. the integers returned by open() on Unix, and the HANDLE returned by CreateFile() on Windows. You would probably put these into an external pointer, to be able to handle them the same way on both platforms.

Then we could use mmap() on Unix and MapViewOfFile() on Windows to serialize an R object into shared memory, and this would really speed up sharing data between processes.

traversc commented 5 years ago

Hi @gaborcsardi, I have a short toy example using file descriptors:

*nix version: https://gist.github.com/traversc/e04911a86c8d581b058815d4aa7e7366 Windows version: https://gist.github.com/traversc/b531a4932e87cca2aa324c6a015c80a4

Do you mind looking it over and seeing if it's what you had in mind?

Some questions for you:

Since we can use file descriptors in both windows and unix-like, that would simplify things, do you think there is still a need to use windows HANDLE ? The windows version worked with surprisingly little modfication.

I'm still not quite clear how mmap would come into play. Could you elaborate an example of how you would use it?

gaborcsardi commented 5 years ago

That's a good start! Unfortunately I don't think we can use the integer file descriptors on Windows, not everything is a file on Windows, and e.g. the shared memory handles will not work. But I am actually not completely sure about this.

Re. mmap, we will do this:

  1. open a temp file for writing (or CreateFileMappingA() on Windows) to get an fd
  2. delete the file
  3. resize the file to the "correct size"
  4. call mmap() on the fd to create a memory area in shared memory
  5. copy the data we want to share to shared memory, e.g. serialize into the fd.

Then we pass the fd to subprocesses, and they do an mmap on it as well, and unserialize.

Some bits of this is in https://github.com/r-lib/processx/pull/201 but it needs quite some rewrite still. This has something like a serialization that only works for a list of atomic, non-character vectors. But it does have the advantage that the subprocesses do not need to unserialize, but they can create the objects "within" the serialized data. This is something we probably lose with a proper serialization, unless we design a serialization format that explicitly supports it.

traversc commented 5 years ago

Hi @gaborcsardi, I think I've put together all the requests in the latest commit. I had to do a bunch of re-factoring to use templates instead of assuming std::fstream.

I have the following new functions:

I also have the following helper functions:

qsave and variants also now return invisibly the number of bytes written (as a double; an int is too small for large data)

Here are some examples:

Data:

n <- 5e6
data <- data.frame(a=rnorm(n), 
                   b=rpois(100,n),
                   c=sample(starnames$IAU,n,T),
                   d=sample(state.name,n,T),
                   stringsAsFactors = F)

On Linux/Mac:

library(qs)
fd <- qs:::openFd("/tmp/test.z", "wr")
unlink("/tmp/test.z")
length <- qsave_fd(data, fd, preset = "high")
mptr <- qs:::openMmap(fd, length)
data2 <- qread_ptr(mptr, length)
qs:::closeMmap(mptr, length)
qs:::closeFd(fd)
identical(data, data2)

On Windows:

fh <- qs:::openHandle("N:/test.z", "wr")
unlink("N:/test.z")
length <- qsave_handle(data, fh, preset = "high")
fmh <- qs:::openWinFileMapping(fh, length)
ptr <- qs:::openWinMapView(fmh, length)
data2 <- qread_ptr(ptr, length)
qs:::closeWinMapView(ptr)
qs:::closeHandle(fmh)
qs:::closeHandle(fh)
identical(data, data2)

Serialize to raw vector:

qd <- qserialize(data)
data2 <- qdeserialize(qd)
identical(data, data2)

Anyway, lmk what you think. Thanks.

gaborcsardi commented 5 years ago

Awesome! Thanks for doing this. I'll take a good look very soon, sorry for the delay.

artemklevtsov commented 4 years ago

Can I use this features (qsave_fd) to append data? I think not, but I want to clarify.

traversc commented 4 years ago

@artemklevtsov Technically, yes. You would have to open a file descriptor in append mode:

https://stackoverflow.com/questions/7136416/opening-file-in-append-mode-using-open-api

But I don't recommend doing this, as I don't guarantee being able to correctly deserialize data if there are extra bytes at the end of a file.

artemklevtsov commented 4 years ago

@traversc thank you for the explanation. Do you have any plans to add a feature like that? I look for an alternative for the data.table::fwrite with append to fetch a bulk data.

traversc commented 4 years ago

@artemklevtsov No plans for that feature as the format isn't set up for that, sorry.

I know that with the fst package you can do that with data.frames, and I definitely support using fst for that purpose.

Alternatively, you can save two separate data.frame objects and use rbind after reading. That should also be pretty fast.