SSH transfer scales badly for large data

mschubert / clustermq

R package to send function calls as jobs on LSF, SGE, Slurm, PBS/Torque, or each via SSH

https://mschubert.github.io/clustermq/

Apache License 2.0

146 stars 27 forks source link

SSH transfer scales badly for large data #222

Closed mschubert closed 8 months ago

mschubert commented 3 years ago

clustermq::Q(object.size, x=list(rnorm(1e8)), n_jobs=1)

Connecting USER@HOST via SSH ...
Sending common data ...
Running 1 calculations (0 objs/0 Mb common; 1 calls/chunk) ...
Master: [30.5s 9.3% CPU]; Worker: [avg 13.5% CPU, max 201693041.0 Mb]
Error in summarize_result(job_result, n_errors, n_warnings, cond_msgs,  : 
  1/1 jobs failed (0 warnings). Stopping.
(Error #1) object 'C_objectSize' not found

Originally posted by @mattwarkentin in https://github.com/wlandau/targets/issues/237#issuecomment-736016884

mschubert commented 3 years ago

@mattwarkentin That may actually be an issue with object.size (because it accesses an R internal), what about if you use sum?

mattwarkentin commented 3 years ago

clustermq::Q(function(x) sum(x), x = list(rnorm(1e8)), n_jobs = 1)


Connecting via SSH ...
Sending common data ...
Running 1 calculations (0 objs/0 Mb common; 1 calls/chunk) ...
Master: [23.4s 12.2% CPU]; Worker: [avg 14.0% CPU, max 201689286.0 Mb]
[[1]]
[1] 1940.743

mschubert commented 3 years ago

Ok. So the reported memory is wrong, but the return time looks good.

How about if you use the same size as your problematic file in memory? How about if you use the actual file contents you had?

(It's late here and I can't think anymore, I'll revisit this tomorrow)

mattwarkentin commented 3 years ago

No worries!

I will try out a handful of tests using toy and real data and post the results here.

mattwarkentin commented 3 years ago

Sending toy data nearly the same size as actual data (~3.2Gb)

lobstr::obj_size(rnorm(4e8))

3,200,000,048 B

For comparison, the size of the data in my previous comment (rnorm(1e8)) was 800,000,048 B, or 800Mb.

clustermq::Q(function(x) sum(x), x = list(rnorm(4e8)), n_jobs = 1)

The above command timed out after 20 minutes (clustermq.worker.timeout = 1200). So for data that is 4x larger in-memory, it took at least 52x longer to transfer until it timed out. Seems like the transfer times scales nearly cubically with in-memory sizes (actually, I guess it could be anything cubic or larger, we don't really know).

wlandau commented 3 years ago

Seems like that explains https://github.com/wlandau/targets/issues/237 (please correct me if I am wrong).

mattwarkentin commented 3 years ago

Good timing. I just updated the targets issue to report these findings.

mschubert commented 8 months ago

Testing this on 0.9.2.9000 using SSH:

fx = function(NUM) clustermq::Q(function(x) sum(x), x=list(rnorm(NUM)), n_jobs = 1)
sapply(c(2.5e7, 5e7, 1e8, 2e8, 4e8), fx)

Number of `rnorm`	Data size	Time to complete
2.5e7	200 Mb	24 seconds
5e7	400 Mb	47 seconds
1e8	800 Mb	1.4 minutes
2e8	1.6 Gb	2.7 minutes
4e8	3.2 Gb	5.4 minutes

For comparison, transferring 1e8 random numbers in a 733 Mb rds file via scp took 1.2 minutes.

So overall, this does not seem to be a problem (anymore?)