iput and isaverds stream from memory (resolves Issue irods/irods_client_library_rirods#28)

MartinSchobben commented 1 year ago

This is a performance fix by streaming for iput and isaveRDS directly from memory, where in the former implementation this happened with an intermediate save to disk.

trel commented 1 year ago

nothing jumps out - seems good to me. if tests are passing and you're happy, i'm good.

MartinSchobben commented 1 year ago

Can we maybe run it by Mariana. I could not add reviewers besides iRODS members.

trel commented 1 year ago

oh! let's see if we can get her into the list... sent an invite.

MartinSchobben commented 1 year ago

Yes, that was also on my mind. But I have to read large file (e.g. csv) into memory anyway to be able to chunk them before sending them. So to then again save the intermediate to disk was probably not the most efficient.

trel commented 1 year ago

Naive question... can you read the file in chunks and send the chunks directly? and not have to hold the entire large file in memory all at once?

of course, if it's already in memory... it doesn't matter...

MartinSchobben commented 1 year ago

Yes, that would be best. To iterate over a connection, but I have not found yet how that works.

trel commented 1 year ago

quick search finds this... which is not exactly about sending over a network, but... approaches seem related. https://inbo.github.io/tutorials/tutorials/r_large_data_files_handling/

MartinSchobben commented 1 year ago

It might indeed not be that hard. It requires some extensions. What we probably can say is that this PR is a step forward. In the next PR I include then disk_to_irods() in the backend of iput().

MartinSchobben commented 1 year ago

I reconsidered maybe let's hold the merge for a bit. Maybe the change is really minor to make it perfect.

MartinSchobben commented 1 year ago

It seems really not that hard to implement. Now the following should happen:

iput # disk -> irods
isaveRDS # memory -> irods
iget # irods -> disk
ireadRDS # irods -> memory

trel commented 1 year ago

very nice. is it faster than going to the temporary disk? can you measure before/after for a large file?

MartinSchobben commented 1 year ago

I'll see that I do some benchmarks. I presume that preventing moving from disk to memory as intermediate step is anyway beneficial in terms of available ram on an average laptop or PC.

MartinSchobben commented 1 year ago

Puzzling results. My first fix made iput() better, as expected. My last iteration without an intermediate step of using an R object makes everything worse (iget() and iput()) than it used to be in the main branch.

See results (n = 100) in the README here: https://github.com/MartinSchobben/irods_client_library_rirods/tree/dev/inst/bench

trel commented 1 year ago

Could be because reading the entire file at once is faster than going back for chunks? Try with different file sizes to see if that has an effect.

MartinSchobben commented 1 year ago

I will try the different sizes for iput(), but it seems that I broke iget() in my last commit. Need to analyse iget() some more, as I do not see directly why that failed.

montesmariana commented 1 year ago

So, is disk -> memory -> iRODS faster than disk -> memory -> disk -> iRODS, but disk -> iRODS is somewhat slower in larger files? Or am I misunderstanding?

MartinSchobben commented 1 year ago

So, I figured out that the commit: disk -> memory -> iRODS (which is not the head of my dev branch) contains an error. These results are not reliable, so instead I focus now on the last commit (disk -> iRODS), as this seems to be the most correct way anyway. On top of that the results show that this has almost the same (a bit worse) performance than the irods main branch (disk -> memory -> disk -> iRODS): https://github.com/MartinSchobben/irods_client_library_rirods/tree/dev/inst/bench.

I also see some room of improvement by changing some parameters of the connection (file) arguments.

Furthermore, I need better unit tests as some things are still slipping through, as this experience thought me.

MartinSchobben commented 1 year ago

What do you think of the current changes in terms of performance? Indeed these updates are good for memory usage, although less increased can be gained for execution time. I also updated the tests, so I am now sure that this works.

https://github.com/MartinSchobben/irods_client_library_rirods/tree/dev/inst/bench

Large file sizes are still quite horrible as it takes ages to execute. Am I right in assuming that the REST API buffer has now been increased to about 8~Gb~Mb, as that would make a huge difference.

trel commented 1 year ago

I'm having a tough time understanding the benchmark results.

Am I seeing GBs of memory allocated to put or get a 6MB file? And it's taking 11 minutes?

What is max_memory?

And yes, the REST API is probably going to get a lot smarter... and maybe even a rewrite as a new/separate HTTP API (dropping the confusing 'REST' title)... to be determined...

MartinSchobben commented 1 year ago

I updated some of the information in the table, figures and text. But indeed in the current main branch on irods iget can consume a lot of memory, but this is fixed in this PR.

Do note that the current example of a file with a size of 6Mb can take thousands of iterations because of the current small buffer size for requests. When this is increased to 8Mb then this problem is also circumvented.

trel commented 1 year ago

Got it. Yes a bigger REST buffer is needed and that overhead would dominate the results.

MartinSchobben commented 1 year ago

Finally, all seems to be as good as it gets. For completeness I have added benchmarks for isaveRDS() and ireadRDS(). However, this another story as the serialization for R objects can be very efficient. This would in general results in super fast execution times for those two functions. This is good as those functions are important, where I still regard iget() and iput() of secondary importance for the general R workflow. But when the buffer size of the iRODS REST API increases then also those should give good results up to several Mbs in file size in regard of execution time, where the memory usage is now fixed by directly iterating over connection to the file on the disk.

I have added a GitHub actions workflow that will run benchmarks every time R/data-objects.R changes, so we can keep track if we make changes to the respective functions.

trel commented 1 year ago

merging

irods / irods_client_library_rirods

iput and isaverds stream from memory (resolves Issue irods/irods_client_library_rirods#28) #29