mllg / batchtools

Tools for computation on batch systems
https://mllg.github.io/batchtools/
GNU Lesser General Public License v3.0
170 stars 51 forks source link

Why are results saved per call instead of per chunk? #222

Closed mschubert closed 5 years ago

mschubert commented 5 years ago

I am the author of a similar package (clustermq) that distributes parallel computations on HPC schedulers via the network (using ZeroMQ), and return them to the main session without the involvement of network storage.

I am currently working on a publication of clustermq, and am comparing its performance to batchtools if there are high numbers of short-running jobs (up to 109 calls). I expected batchtools to be slower because of its reliance on network storage, however, the difference was much larger than expected (up to 1000x for overhead alone, see plot below).

This was done using using a trivial call and setting the number of chunks equal to the number of jobs:

fun = function(x) x*2,
x = runif(n_calls)
result = btmapply(fun, x=x, n.chunks=n_jobs, reg=reg)

This is of course a border case and HPC computations will more elaborate, but I found that this still makes a significant difference for real-world analysis examples (e.g. when batchtools is not able to process any sort of computation of 107 calls or more).

When investigating this in more detail, I found that batchtools creates separate files for the results of every call, irrespective of whether I used chunking or not. Before, BatchJobs behaved the same way so I would think this is a design choice rather than a bug.

Similar issues have been raised in the past, e.g. suggesting parallel collection or database sharding, but to me the most obvious solution would be to limit the number of result files.

Wouldn't it make more sense for batchtools to create one results file per chunk? This would allow much faster retrieval of a high number of (small) results, and isn't this what chunks are for?

mllg commented 5 years ago

Thanks for getting in contact beforehand. Your package looks great!

[...] This was done using using a trivial call and setting the number of chunks equal to the number of jobs:

I don't get this. This disables chunking, right? If you have a queue with 100 nodes, I would set the number of chunks to 100.

Usually, you chunk jobs together according to the walltime. If you have a queue with a walltime of 8 hours, you would chunk jobs together so that a single computational job (= 1 chunk) runs for 90% of the walltime, i.e. 7.2h.

This is of course a border case and HPC computations will more elaborate, but I found that this still makes a significant difference for real-world analysis examples (e.g. when batchtools is not able to process any sort of computation of 107 calls or more).

TBH, I never worked with more than a few million jobs in a single registry. With a background in machine learning and bioinf, jobs take usually at least some minutes which makes more than 10e7 jobs impossible to calculate.

When investigating this in more detail, I found that batchtools creates separate files for the results of every call, irrespective of whether I used chunking or not. Before, BatchJobs behaved the same way so > I would think this is a design choice rather than a bug.

One major annoyance on HPCs is, to my experience, the troubleshooting if something went wrong. That's why batchtools tries to be very robust here: If the n-th job of the chunk fails with a segfault (or the scheduler decides it is time to terminate), all previously calculated results will be available on the master.

In general, batchtools tries to be as much "fire and forget" as possible. This allowed us to work on heterogeneous docker clusters, in an Azure cloud or on EC2. I was able to use the packages on systems with a network file system with a latency of more than 1 minute, on systems where the network was unreliable, and systems where some computational nodes became occasionally unavailable. Also note that batchtools does not rely on a master process, as this is sometimes frowned upon by sysadmins (and killed by their watchdogs). Apparently, this robustness has its price, as shown in your benchmark.

Similar issues have been raised in the past, e.g. suggesting parallel collection or database sharding, but to me the most obvious solution would be to limit the number of result files.

Sharding could mitigate the runtime overhead, and will be implemented in the next version of the package. However, the benefit of sharding strongly depends on the network file system. This is why I would also recommend to run a (smaller?) version of your benchmark on a second system with different specs.

Wouldn't it make more sense for batchtools to create one results file per chunk? This would allow much faster retrieval of a high number of (small) results, and isn't this what chunks are for?

I thought about it, too. The problem is mainly the file format. I would need a file format which allows me to retrieve partial results without reading the complete file. SQLite is a candidate.

mschubert commented 5 years ago

Thank you for your elaborate response.

I don't get this. This disables chunking, right?

Sorry if this was confusing: I used jobs (and n_jobs) as the number of jobs submitted on a batch system and calls as the number of function calls to be evaluated. So the above should assign one chunk to each job submitted to the scheduler (running calls/jobs computations).

With a background in machine learning and bioinf, jobs take usually at least some minutes which makes more than 10e7 jobs impossible to calculate.

Yes, batchtools is great for running calls that take a couple of minutes to compute. The main reason I started to write clustermq is that I had about 10M linear regression calls for the GDSC project that I wanted to be able to process without having to manually write functions that process 1/nth of it (this was still using BatchJobs).

One major annoyance on HPCs is, to my experience, the troubleshooting if something went wrong.

This is absolutely correct. I think I've got a good handle on R errors (via calling handlers) and memory (via ulimit), but segfaults are still problematic for my package, leaving the user to manually check scheduler log files. I'm looking into using worker heartbeats (https://github.com/mschubert/clustermq/issues/33), but this is still future work.

Also note that batchtools does not rely on a master process, as this is sometimes frowned upon by sysadmins

That's interesting. I had no issues with the master process, but did overstress the file system with batchtools on occasion.

Apparently, this robustness has its price, as shown in your benchmark.

I definitely think there is room for both packages. I'm not aiming to replace batchtools, but provide an alternative for use cases where your approach becomes a non-negligible bottleneck.

There is value in a more robust approach at the cost of performance, and I will make sure to state this.

This is why I would also recommend to run a (smaller?) version of your benchmark on a second system with different specs

I did try on a second system, and the results were comparable.

For a smaller numbers of calls (100s) or longer run times (minutes) on 10-50 jobs, there is no performance reason to use clustermq over batchtools (but there could be for SSH+scheduler or using the packages OOTB).

Ideally, I would also want to test longer running calls with 100s of jobs, but we just don't have the bandwidth at our computing facilities (I had to run the current test spaced over 3 weeks to avoid scheduling priority issues).

I thought about it, too. The problem is mainly the file format. I would need a file format which allows me to retrieve partial results without reading the complete file. SQLite is a candidate.

I see that this is an issue, especially considering the SQLite file locking issues with BatchJobs that prompted batchtools in the first place (and I struggled with too when using BatchJobs).

mllg commented 5 years ago

Sorry if this was confusing: I used jobs (and n_jobs) as the number of jobs submitted on a batch system and calls as the number of function calls to be evaluated. So the above should assign one chunk to each job submitted to the scheduler (running calls/jobs computations).

Ah, now I get it. Thanks for the explanation.

Yes, batchtools is great for running calls that take a couple of minutes to compute. The main reason I started to write clustermq is that I had about 10M linear regression calls for the GDSC project that I wanted to be able to process without having to manually write functions that process 1/nth of it (this was still using BatchJobs).

Yes, this seems like an example where you would need to manually chunk jobs together prior to use batchtools.

I see that this is an issue, especially considering the SQLite file locking issues with BatchJobs that prompted batchtools in the first place (and I struggled with too when using BatchJobs).

If you write a single SQLite database per chunk, you only have to create it once and can then access it in read-only mode. This should solve the filelock issues. However, I'm still looking for viable alternatives.

I've linked clustermq in the README (1188e2a). Good luck with your publication!

HenrikBengtsson commented 5 years ago

Thank you both for this - I wasn't aware of this (either). Just a reflection below.

Yes, this seems like an example where you would need to manually chunk jobs together prior to use batchtools.

This is a very useful comment - one could say there are two types of chunking. The one batchtools provides is a type of lower-level chunking which might not be what the everyday user would expect coming from other parallelization frameworks where they except the type of chunking proposed in "manually chunk jobs together prior to use batchtools". Maybe this could be clarified in the help and/or by changing the terminology. OTH, couldn't batchtools add support for both types?