Buffer overflow error for large number of jobs

mschubert commented 9 years ago

When submitting a large number of jobs, BatchJobs still fails for me (this is somewhat similar to https://github.com/tudo-r/BatchJobs/issues/58, but the number of jobs is almost 50 times higher).

I submit between 275,000 and 500,000 jobs in 1, 2, 10, and 25 chunks.

BatchJobs_1.7 BBmisc_1.9

Submitting jobs in one chunk always works, so does sending 2 chunks. 10 chunks sometimes works and sometimes doesn't, and 25 chunks never works.

If staged.queries = TRUE (otherwise same behaviour as in https://github.com/tudo-r/BatchJobs/issues/58), independent of db.options = list(pragmas = c("busy_timeout=5000", "journal_mode=WAL")) and fs.timeout:

in submitJobs() call function itself runs fine until return(invisible(ids))
message "Might take some time, do not interrupt this!"
after this, all jobs are killed/crash/disappear
if I waitForJobs(), R segfaults

Might take some time, do not interrupt this!
Syncing registry ...
Waiting [S:550000 D:0 E:0 R:0] |+                                 |   0% (00:00:00)
Status for 550000 jobs at 2015-08-01 18:00:21
Submitted: 550000 (100.00%)
Started:   550000 (100.00%)
Running:      0 (  0.00%)
Done:         0 (  0.00%)
Errors:       0 (  0.00%)
Expired:   550000 (100.00%)
Time: min=NAs avg=NAs max=NAs
       n submitted started done error running expired t_min t_avg t_max
1 550000    550000  550000    0     0       0  550000    NA    NA    NA
*** buffer overflow detected ***: /usr/lib/R/bin/exec/R terminated
======= Backtrace: =========
/lib64/libc.so.6(__fortify_fail+0x37)[0x3ec4302527]
/lib64/libc.so.6[0x3ec4300410]
/lib64/libc.so.6[0x3ec42ff2c7]
/usr/lib/R/lib/libR.so(+0xfc01b)[0x2b722845901b]
/usr/lib/R/lib/libR.so(+0xfff4e)[0x2b722845cf4e]
/usr/lib/R/lib/libR.so(+0xfffbf)[0x2b722845cfbf]

mllg commented 9 years ago

I've just discovered a bug: setting fs.timeout disabled staging (D'OH!). I've already pushed a fix for this (cefc9c273daf0e48dee7156249e63b55457601da).

Furthermore, I tried to finally get rid of all database problems once and forever by avoiding all database transaction on the nodes (although read-only should in theory be safe). This is now also in the devel branch and automatically enabled if staged.queries is set. Could you please give it a try on your system?

mschubert commented 9 years ago

I'm having issues with submitJobs() in https://github.com/tudo-r/BatchJobs/commit/cefc9c273daf0e48dee7156249e63b55457601da.

Backend: LSF Submitting 292 chunks / 1457500 jobs.

In writeFiles(), Map(...) causes the submission of jobs to take over 5 hours total.

I realize that the number of jobs is high, but the number of chunks should be reasonable in a way that submission should be done a a minute (without explicit job.delay - note that with staged.queries = FALSE submission takes about 30 minutes, but then some jobs fail because of locked db).

I'm thinking that this is mainly due to file system load that already starts during submission, but later continues at running - I designed my algorithm in a way that bigger objects are all passed via more.args and each individual function call (=job) does only need an index on those objects (couple of bytes). Still, the file system becomes a lot less responsive after a couple of chunks are started.

If I understand my debugging correctly, BatchJobs runs a new instance of R for every job (not chunk, both for staged.queries=T/F)? If that's the case, then BatchJobs can not be used over approximately 1M function calls with > 100 chunks (which would be a pity).

mllg commented 9 years ago

Well that's basically the tradeoff. You either rely on the database to retrieve the information and risk running into a locked database. Or you just store everything on the file system to avoid querying the database on the nodes which might be a big overhead.

Are you sure that https://github.com/tudo-r/BatchJobs/blob/master/R/writeFiles.R#L34 is the bottleneck? I've tried myself with 1e6 jobs and found that it takes less than 5 minutes on a local (but slow) hdd. If you're sure that the Map() takes most of the time, I'll try to optimize this for high-latency file systems, i.e. writing the job information chunked.

mschubert commented 9 years ago

The other question would be: is it really required for the master to have information about every job on each slave at any time? I certainly don't, and if not, then you could avoid most of the db or file system load altogether: for instance, you use one R session per chunk and only report chunk statistics. This would make the whole thing a lot more scalable (but I realize this would be a major undertaking).

I am sure that that Map() is the problem for the first 20 chunks submitted (stepped through in the debugger). After the first couple of chunks it also gets remarkably slower, but I haven't taken a look at this specifically. I can do a full profiling, but this will take at least a day because of the nature of the issue.

mllg commented 9 years ago

The other question would be: is it really required for the master to have information about every job on each slave at any time? I certainly don't, and if not, then you could avoid most of the db or file system load altogether: for instance, you use one R session per chunk and only report chunk statistics. This would make the whole thing a lot more scalable (but I realize this would be a major undertaking).

We kind of do this already by using a buffer which is flushed every 5-10 minutes (c.f. doJob.R, msg.buf).

I am sure that that Map() is the problem for the first 20 chunks submitted (stepped through in the debugger). After the first couple of chunks it also gets remarkably slower, but I haven't taken a look at this specifically. I can do a full profiling, but this will take at least a day because of the nature of the issue.

Please try if 0f914a2bf75ddaa2f6223f9880e207c0ec79c5f5 mitigates the runtime issues.

mschubert commented 9 years ago

https://github.com/tudo-r/BatchJobs/commit/0f914a2bf75ddaa2f6223f9880e207c0ec79c5f5: submission is down to 9 minutes (30x speedup), "Syncing registry" afterwards takes 10 minutes.

File system load overall is still too high (at about 300 jobs running), I had to manually stop and resume jobs to keep the volume responsive.

Reducing the results over night only got 4%, time remaining is shown as 99 hours. When complete after 4 days, R crashed with a message saying:

* caught bus error * address 0x2abbf7d08e50, cause 'non-existent physical address' Bus error (core dumped)

mllg commented 9 years ago

0f914a2: submission is down to 9 minutes (30x speedup), "Syncing registry" afterwards takes 10 minutes.

Well that sound acceptable to me.

File system load overall is still too high (at about 300 jobs running), I had to manually stop and resume jobs to keep the volume responsive.

I'll set the update frequency for chunked jobs more conservatively. But also check your log files -- if you produce a lot of output, this could be the problem. You could probably try to redirect logs to /dev/null.

Reducing the results over night only got 4%, time remaining is shown as 99 hours. When complete after 4 days, R crashed with a message saying:

* caught bus error * address 0x2abbf7d08e50, cause 'non-existent physical address' Bus error (core dumped)

I think I did not touch anything here ... have you solved this? I would assume this is a (temporary) file system problem. We just iterate over the results and load them, nothing special.

mschubert commented 9 years ago

The crash was caused by a bug in dplyr (used to assemble my results after - this has got nothing to do with BatchJobs).

The time it takes for reducing the results remains, however.

mllg commented 9 years ago

Well, maybe reading 500.000 files just takes some time. But if you give me some more information, I can try to optimize this step a bit.

How does your result look like, i.e. how big is a single result ( as reported by object.size())?
What function do you call exactly? reduceResults()?
What is the runtime of a reduction step? Do you think it is linear/quadratic/exponential wrt the number of jobs?
Is it possible to use reduceResultsList() instead? This one pre-allocates and thus does not need to copy aggr in every iteration which is a likely bottleneck.
Is reduceResultsParallel() a viable alternative?

mschubert commented 9 years ago

Thank you for your contined efforts!

My result is a single numeric representing the mean cross-validation error of a trained glmnet model
I already use reduceResultsList()
reduceResultsParallel() would put additional strain on the file system and I'd like to avoid that
I haven't tested how the timing behaves with different numbers of jobs, and I don't have the time for the next two weeks unfortunately (I'll update here in ping you once I get around to doing it)

In general, I think that the approach of having one result file with one numeric per function call is not feasible with >1M calls on a high latency fs.

I also played around using rzmq that bypasses the file system altogether and could see 100x speedups.

tudo-r / BatchJobs

Buffer overflow error for large number of jobs #96