Open mschubert opened 9 years ago
I've just discovered a bug: setting fs.timeout
disabled staging (D'OH!). I've already pushed a fix for this (cefc9c273daf0e48dee7156249e63b55457601da).
Furthermore, I tried to finally get rid of all database problems once and forever by avoiding all database transaction on the nodes (although read-only should in theory be safe). This is now also in the devel branch and automatically enabled if staged.queries
is set. Could you please give it a try on your system?
I'm having issues with submitJobs()
in https://github.com/tudo-r/BatchJobs/commit/cefc9c273daf0e48dee7156249e63b55457601da.
Backend: LSF Submitting 292 chunks / 1457500 jobs.
In writeFiles()
, Map(...)
causes the submission of jobs to take over 5 hours total.
I realize that the number of jobs is high, but the number of chunks should be reasonable in a way that submission should be done a a minute (without explicit job.delay
- note that with staged.queries = FALSE
submission takes about 30 minutes, but then some jobs fail because of locked db).
I'm thinking that this is mainly due to file system load that already starts during submission, but later continues at running - I designed my algorithm in a way that bigger objects are all passed via more.args
and each individual function call (=job) does only need an index on those objects (couple of bytes). Still, the file system becomes a lot less responsive after a couple of chunks are started.
If I understand my debugging correctly, BatchJobs
runs a new instance of R for every job (not chunk, both for staged.queries=T/F
)? If that's the case, then BatchJobs
can not be used over approximately 1M function calls with > 100 chunks (which would be a pity).
Well that's basically the tradeoff. You either rely on the database to retrieve the information and risk running into a locked database. Or you just store everything on the file system to avoid querying the database on the nodes which might be a big overhead.
Are you sure that https://github.com/tudo-r/BatchJobs/blob/master/R/writeFiles.R#L34 is the bottleneck? I've tried myself with 1e6 jobs and found that it takes less than 5 minutes on a local (but slow) hdd. If you're sure that the Map()
takes most of the time, I'll try to optimize this for high-latency file systems, i.e. writing the job information chunked.
The other question would be: is it really required for the master to have information about every job on each slave at any time? I certainly don't, and if not, then you could avoid most of the db or file system load altogether: for instance, you use one R session per chunk and only report chunk statistics. This would make the whole thing a lot more scalable (but I realize this would be a major undertaking).
I am sure that that Map()
is the problem for the first 20 chunks submitted (stepped through in the debugger). After the first couple of chunks it also gets remarkably slower, but I haven't taken a look at this specifically. I can do a full profiling, but this will take at least a day because of the nature of the issue.
The other question would be: is it really required for the master to have information about every job on each slave at any time? I certainly don't, and if not, then you could avoid most of the db or file system load altogether: for instance, you use one R session per chunk and only report chunk statistics. This would make the whole thing a lot more scalable (but I realize this would be a major undertaking).
We kind of do this already by using a buffer which is flushed every 5-10 minutes (c.f. doJob.R
, msg.buf
).
I am sure that that Map() is the problem for the first 20 chunks submitted (stepped through in the debugger). After the first couple of chunks it also gets remarkably slower, but I haven't taken a look at this specifically. I can do a full profiling, but this will take at least a day because of the nature of the issue.
Please try if 0f914a2bf75ddaa2f6223f9880e207c0ec79c5f5 mitigates the runtime issues.
https://github.com/tudo-r/BatchJobs/commit/0f914a2bf75ddaa2f6223f9880e207c0ec79c5f5: submission is down to 9 minutes (30x speedup), "Syncing registry" afterwards takes 10 minutes.
File system load overall is still too high (at about 300 jobs running), I had to manually stop and resume jobs to keep the volume responsive.
Reducing the results over night only got 4%, time remaining is shown as 99 hours. When complete after 4 days, R crashed with a message saying:
* caught bus error * address 0x2abbf7d08e50, cause 'non-existent physical address' Bus error (core dumped)
0f914a2: submission is down to 9 minutes (30x speedup), "Syncing registry" afterwards takes 10 minutes.
Well that sound acceptable to me.
File system load overall is still too high (at about 300 jobs running), I had to manually stop and resume jobs to keep the volume responsive.
I'll set the update frequency for chunked jobs more conservatively. But also check your log files -- if you produce a lot of output, this could be the problem. You could probably try to redirect logs to /dev/null
.
Reducing the results over night only got 4%, time remaining is shown as 99 hours. When complete after 4 days, R crashed with a message saying:
* caught bus error * address 0x2abbf7d08e50, cause 'non-existent physical address' Bus error (core dumped)
I think I did not touch anything here ... have you solved this? I would assume this is a (temporary) file system problem. We just iterate over the results and load them, nothing special.
The crash was caused by a bug in dplyr (used to assemble my results after - this has got nothing to do with BatchJobs).
The time it takes for reducing the results remains, however.
Well, maybe reading 500.000 files just takes some time. But if you give me some more information, I can try to optimize this step a bit.
object.size()
)?reduceResults()
?reduceResultsList()
instead? This one pre-allocates and thus does not need to copy aggr
in every iteration which is a likely bottleneck.reduceResultsParallel()
a viable alternative?Thank you for your contined efforts!
numeric
representing the mean cross-validation error of a trained glmnet
modelreduceResultsList()
reduceResultsParallel()
would put additional strain on the file system and I'd like to avoid thatIn general, I think that the approach of having one result file with one numeric
per function call is not feasible with >1M calls on a high latency fs.
I also played around using rzmq
that bypasses the file system altogether and could see 100x speedups.
When submitting a large number of jobs, BatchJobs still fails for me (this is somewhat similar to https://github.com/tudo-r/BatchJobs/issues/58, but the number of jobs is almost 50 times higher).
I submit between 275,000 and 500,000 jobs in 1, 2, 10, and 25 chunks.
Submitting jobs in one chunk always works, so does sending 2 chunks. 10 chunks sometimes works and sometimes doesn't, and 25 chunks never works.
If
staged.queries = TRUE
(otherwise same behaviour as in https://github.com/tudo-r/BatchJobs/issues/58), independent ofdb.options = list(pragmas = c("busy_timeout=5000", "journal_mode=WAL"))
andfs.timeout
:submitJobs()
call function itself runs fine untilreturn(invisible(ids))
waitForJobs()
, R segfaults