irods / irods_client_library_rirods

rirods R Package
https://rirods.irods4r.org
Other
6 stars 5 forks source link

iput/iget should not write temp files to disk before streaming to iRODS #28

Closed MartinSchobben closed 1 year ago

MartinSchobben commented 1 year ago

I think the solution is to make a connection to iRODS and then stream from memory to the final destination. Although I do not know how this looks like with a connection to iRODS, locally this could look like this:

# test object
x <- matrix(1:100, 10, 10)

# serialize r object write in memory to vector (connection = NULL)
y <- serialize(x, connection = NULL)
# length of object for chunking
size_y <- length(y)
# make a file -> this should then be an object on iRODS
fil <- tempfile()
# this is an R connection (IO stream object) -> this should become a connection to iRODS REST 
tmp <- file(fil)
# open the connection
open(tmp, "wb")
# chunk 1
writeBin(y[1:(size_y / 2)], tmp)
# chunk 2
writeBin(y[(size_y / 2 + 1):size_y], tmp)
# destroy connections
close(tmp)

# open connection  -> this should become a connection to iRODS REST 
con <- file(fil, "rb") 
# read object -> back to memory
# chunk 1 (`fil` would work as well but I use a connection here as it is 
# closer to the iRODS REST situation)
x1 <- readBin(con, raw(), n = size_y / 2, endian = "swap")
# chunk 2
x2 <- readBin(con, raw(), n = size_y / 2, endian = "swap")
# fuse chunks
z <- c(x1, x2)
# check if complete
all.equal(z, y)
# unserialize
unserialize(z)
korydraughn commented 1 year ago

Yes, that is how istream works. It reads bytes from stdin and sends chunks (via an in-memory buffer) to the iRODS server. The downside is that the length of the input stream is unknown. That means istream does not support parallel transfer.

Because you're using the REST API, you won't have parallel transfer available either. Just something to keep in mind.

MartinSchobben commented 1 year ago

Yes, I guess we have to life with that. But the implementation for the serial approach on the R side is poor at the moment, as I designed it to write to disk first and then send to the REST API. The implementation as shown before might circumvent this. I placed it here to remind myself that I should look into this.