Closed MartinSchobben closed 1 year ago
nothing jumps out - seems good to me. if tests are passing and you're happy, i'm good.
Can we maybe run it by Mariana. I could not add reviewers besides iRODS members.
oh! let's see if we can get her into the list... sent an invite.
Yes, that was also on my mind. But I have to read large file (e.g. csv) into memory anyway to be able to chunk them before sending them. So to then again save the intermediate to disk was probably not the most efficient.
Naive question... can you read the file in chunks and send the chunks directly? and not have to hold the entire large file in memory all at once?
of course, if it's already in memory... it doesn't matter...
Yes, that would be best. To iterate over a connection, but I have not found yet how that works.
quick search finds this... which is not exactly about sending over a network, but... approaches seem related. https://inbo.github.io/tutorials/tutorials/r_large_data_files_handling/
It might indeed not be that hard. It requires some extensions. What we probably can say is that this PR is a step forward. In the next PR I include then disk_to_irods()
in the backend of iput()
.
I reconsidered maybe let's hold the merge for a bit. Maybe the change is really minor to make it perfect.
It seems really not that hard to implement. Now the following should happen:
iput # disk -> irods
isaveRDS # memory -> irods
iget # irods -> disk
ireadRDS # irods -> memory
very nice. is it faster than going to the temporary disk? can you measure before/after for a large file?
I'll see that I do some benchmarks. I presume that preventing moving from disk to memory as intermediate step is anyway beneficial in terms of available ram on an average laptop or PC.
Puzzling results. My first fix made iput()
better, as expected. My last iteration without an intermediate step of using an R object makes everything worse (iget()
and iput()
) than it used to be in the main branch.
See results (n = 100) in the README here: https://github.com/MartinSchobben/irods_client_library_rirods/tree/dev/inst/bench
Could be because reading the entire file at once is faster than going back for chunks? Try with different file sizes to see if that has an effect.
I will try the different sizes for iput()
, but it seems that I broke iget()
in my last commit. Need to analyse iget()
some more, as I do not see directly why that failed.
So, is disk -> memory -> iRODS
faster than disk -> memory -> disk -> iRODS
, but disk -> iRODS
is somewhat slower in larger files? Or am I misunderstanding?
So, I figured out that the commit: disk -> memory -> iRODS
(which is not the head of my dev
branch) contains an error. These results are not reliable, so instead I focus now on the last commit (disk -> iRODS
), as this seems to be the most correct way anyway. On top of that the results show that this has almost the same (a bit worse) performance than the irods main branch (disk -> memory -> disk -> iRODS
): https://github.com/MartinSchobben/irods_client_library_rirods/tree/dev/inst/bench.
I also see some room of improvement by changing some parameters of the connection (file
) arguments.
Furthermore, I need better unit tests as some things are still slipping through, as this experience thought me.
What do you think of the current changes in terms of performance? Indeed these updates are good for memory usage, although less increased can be gained for execution time. I also updated the tests, so I am now sure that this works.
https://github.com/MartinSchobben/irods_client_library_rirods/tree/dev/inst/bench
Large file sizes are still quite horrible as it takes ages to execute. Am I right in assuming that the REST API buffer has now been increased to about 8~Gb~Mb, as that would make a huge difference.
I'm having a tough time understanding the benchmark results.
Am I seeing GBs of memory allocated to put or get a 6MB file? And it's taking 11 minutes?
What is max_memory?
And yes, the REST API is probably going to get a lot smarter... and maybe even a rewrite as a new/separate HTTP API (dropping the confusing 'REST' title)... to be determined...
I updated some of the information in the table, figures and text. But indeed in the current main branch on irods iget
can consume a lot of memory, but this is fixed in this PR.
Do note that the current example of a file with a size of 6Mb can take thousands of iterations because of the current small buffer size for requests. When this is increased to 8Mb then this problem is also circumvented.
Got it. Yes a bigger REST buffer is needed and that overhead would dominate the results.
Finally, all seems to be as good as it gets. For completeness I have added benchmarks for isaveRDS()
and ireadRDS()
. However, this another story as the serialization for R objects can be very efficient. This would in general results in super fast execution times for those two functions. This is good as those functions are important, where I still regard iget()
and iput()
of secondary importance for the general R workflow. But when the buffer size of the iRODS REST API increases then also those should give good results up to several Mbs in file size in regard of execution time, where the memory usage is now fixed by directly iterating over connection to the file on the disk.
I have added a GitHub actions workflow that will run benchmarks every time R/data-objects.R
changes, so we can keep track if we make changes to the respective functions.
merging
This is a performance fix by streaming for
iput
andisaveRDS
directly from memory, where in the former implementation this happened with an intermediate save to disk.