Closed gaborcsardi closed 5 years ago
Yep, I was looking at pipes for the next version :)
Great! For my use case the best would be to be able to write to a file descriptor, or HANDLE on Windows.
On Wed, 10 Jul 2019, 19:05 Travers, notifications@github.com wrote:
Yep, I was looking at pipes for the next version :)
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/traversc/qs/issues/12?email_source=notifications&email_token=AAFBGQAPRBGSHHY46E3PTLDP6YJHDA5CNFSM4H7PHXQ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZUDVGI#issuecomment-510147225, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFBGQG6TG4Z4EOX6B5PQM3P6YJHDANCNFSM4H7PHXQQ .
I've added two new functions, qsave_pipe
and qread_pipe
for writing to file descriptors or R connections.
Writing to R connections seems to be un-allowed by CRAN normally e.g. https://github.com/tidyverse/readr/issues/856#issuecomment-391787058, but can be enabled when compiling).
I'm going to test it out a bit more and submit CRAN.
Thanks! Unfortunately for my use case R connections are not very good, just a simple Unix fd or a Windows HANDLE would be much better.
I have this set up in two ways -- one way using R connections, the other way using FILE pointers created by popen
from cstdio.h. So for example, you could do this:
> qsave_pipe(1:10, "cat > C:/temp.qc") # cat.exe comes from Rtools installation
> qread_pipe("cat C:/temp.qc")
[1] 1 2 3 4 5 6 7 8 9 10
On the C++ side, this looks something like this:
std::unique_ptr<FILE, decltype(&pclose)> pipe(popen(scon.c_str(), "wb"), pclose);
if (!pipe) {
throw std::runtime_error("popen() failed!");
}
FILE * con = pipe.get();
fwrite(data, 1, length, con);
...
Is that what you had in mind? Working with windows handles or even unix fd's (which I understand are wrapped by FILE * pointers) are a bit beyond my current expertise, and Google isn't being particularly helpful. But I am happy to learn if you could give some tips or pointers on implementation.
Thanks! Well, almost. :) FILE *
is still too difficult, it has its own buffering, etc.
The best for us would be file descriptors, i.e. the integers returned by open()
on Unix, and the HANDLE
returned by CreateFile()
on Windows. You would probably put these into an external pointer, to be able to handle them the same way on both platforms.
Then we could use mmap()
on Unix and MapViewOfFile()
on Windows to serialize an R object into shared memory, and this would really speed up sharing data between processes.
Hi @gaborcsardi, I have a short toy example using file descriptors:
*nix version: https://gist.github.com/traversc/e04911a86c8d581b058815d4aa7e7366 Windows version: https://gist.github.com/traversc/b531a4932e87cca2aa324c6a015c80a4
Do you mind looking it over and seeing if it's what you had in mind?
Some questions for you:
Since we can use file descriptors in both windows and unix-like, that would simplify things, do you think there is still a need to use windows HANDLE
? The windows version worked with surprisingly little modfication.
I'm still not quite clear how mmap
would come into play. Could you elaborate an example of how you would use it?
That's a good start! Unfortunately I don't think we can use the integer file descriptors on Windows, not everything is a file on Windows, and e.g. the shared memory handles will not work. But I am actually not completely sure about this.
Re. mmap, we will do this:
CreateFileMappingA()
on Windows) to get an fdmmap()
on the fd to create a memory area in shared memoryThen we pass the fd to subprocesses, and they do an mmap on it as well, and unserialize.
Some bits of this is in https://github.com/r-lib/processx/pull/201 but it needs quite some rewrite still. This has something like a serialization that only works for a list of atomic, non-character vectors. But it does have the advantage that the subprocesses do not need to unserialize, but they can create the objects "within" the serialized data. This is something we probably lose with a proper serialization, unless we design a serialization format that explicitly supports it.
Hi @gaborcsardi, I think I've put together all the requests in the latest commit. I had to do a bunch of re-factoring to use templates instead of assuming std::fstream
.
I have the following new functions:
I also have the following helper functions:
open
CreateFileA
(Windows)MapViewOfFile
(Windows)qsave and variants also now return invisibly the number of bytes written (as a double; an int is too small for large data)
Here are some examples:
Data:
n <- 5e6
data <- data.frame(a=rnorm(n),
b=rpois(100,n),
c=sample(starnames$IAU,n,T),
d=sample(state.name,n,T),
stringsAsFactors = F)
On Linux/Mac:
library(qs)
fd <- qs:::openFd("/tmp/test.z", "wr")
unlink("/tmp/test.z")
length <- qsave_fd(data, fd, preset = "high")
mptr <- qs:::openMmap(fd, length)
data2 <- qread_ptr(mptr, length)
qs:::closeMmap(mptr, length)
qs:::closeFd(fd)
identical(data, data2)
On Windows:
fh <- qs:::openHandle("N:/test.z", "wr")
unlink("N:/test.z")
length <- qsave_handle(data, fh, preset = "high")
fmh <- qs:::openWinFileMapping(fh, length)
ptr <- qs:::openWinMapView(fmh, length)
data2 <- qread_ptr(ptr, length)
qs:::closeWinMapView(ptr)
qs:::closeHandle(fmh)
qs:::closeHandle(fh)
identical(data, data2)
Serialize to raw vector:
qd <- qserialize(data)
data2 <- qdeserialize(qd)
identical(data, data2)
Anyway, lmk what you think. Thanks.
Awesome! Thanks for doing this. I'll take a good look very soon, sorry for the delay.
Can I use this features (qsave_fd
) to append data? I think not, but I want to clarify.
@artemklevtsov Technically, yes. You would have to open a file descriptor in append mode:
https://stackoverflow.com/questions/7136416/opening-file-in-append-mode-using-open-api
But I don't recommend doing this, as I don't guarantee being able to correctly deserialize data if there are extra bytes at the end of a file.
@traversc thank you for the explanation. Do you have any plans to add a feature like that? I look for an alternative for the data.table::fwrite
with append to fetch a bulk data.
@artemklevtsov No plans for that feature as the format isn't set up for that, sorry.
I know that with the fst
package you can do that with data.frames, and I definitely support using fst
for that purpose.
Alternatively, you can save two separate data.frame objects and use rbind
after reading. That should also be pretty fast.
Do you think it would be possible to add support for this? It would be great to be able to use a pipe/socket and also memory directly.