mllg / batchtools

Tools for computation on batch systems
https://mllg.github.io/batchtools/
GNU Lesser General Public License v3.0
169 stars 51 forks source link

`$conditions` blows up rds `results/*.rds` files #286

Closed JZL closed 1 year ago

JZL commented 1 year ago

Hi,

EDIT: Looking close, this does seem pretty squarely a future + batchtools.future problem, it's just that my quick-to-implement solution was by slightly modifying the batchtools source code. I'll make a better proof of concept and bring it up over there.

I'm not quite sure what to make of this since I use a weird combination of batchtools, batchtools.future, and a custom job runner.

But I'm running batches with 1k+ jobs and moderately sized global variables (< 200 MB) per job. I was trying to reduce the size of the results/1.rds files and noticed that the object.size value for the 1.rds can be much smaller than the actual on-disk size.

After some digging it turns out to be a known problem (e.g. here) and I narrowed it down to the $conditions part of FutureResult object. Where if I clear that list's items, the saved RDS size goes from 80MB -> 780KB.

There are definitely internal environments within those conditions, and I tried to just remove a few of them, but could never get the space to be restored. I think it could be future's global variable stored, but I can't quite find where. I'm planning on bodging a fix where I replace the $environment with the paste0 version of each elements value (so I can see any errors but make sure it's just character vectors)

But I didn't know if this was an interesting problem enough to the general package it was worth investigating further. Or if you had any advice on cleaner ways of dealing with it

Thanks! Batchtools is a huge help for parallelizing my code

HenrikBengtsson commented 1 year ago

Author of future here:

Something in your code ends up producing lots of R conditions (e.g. messages, warnings), or very large ones. All conditions are by default captured by futures (=on the parallel workers) and relayed as-is in the main R session.

I would try to identify what produces all those conditions. If they cannot be avoided (e.g. disable message output in a function via some argument, suppressMessages(), ...), then as a last resort, you can tell futures to not capture all types of conditions. For info on that, see argument conditions to future(), cf. https://future.futureverse.org/reference/future.html.

PS. You closed the issue again without comments. I think it would be helpful to future visitors to know how you solved your problem.

JZL commented 1 year ago

Hi,

Oh thanks for responding sorry I didn't see it until now.

Those are all really good points and I'll look into the conditions argument. My hacky solution (because it was just for personal use) is here. I just forcibly replaced the condition variable with the paste0 string version of it, as a compromise to help me debug with error messages, but without copying all the additional state.

Yeah, closing it without comment is a bad habit. I know some package authors like to not have open issues so for very open ended issues I just immediately close them, but I can leave it up to the package author