ropensci / drake

An R-focused pipeline toolkit for reproducibility and high-performance computing
https://docs.ropensci.org/drake
GNU General Public License v3.0
1.34k stars 128 forks source link

Performance of format = "file" vs. file_in() #1349

Closed matthiasgomolka closed 3 years ago

matthiasgomolka commented 3 years ago

Prework

Question

I'm monitoring several hundreds of GB of files in a drake plan using dynamic branching. Thus, these files are monitored via target(..., format = "file"). When I make() the plan, it takes ~ 25 minutes until it actually starts because it takes so long to skip (that's what it says in the log file) the targets with format = "file".

This is somewhat surprising, because in another drake plan where even bigger files are monitored - using file_in() and file_out() instead of format = "file" - importing these takes only ~ 1 minute.

  1. Is file_in() / file_out() generally faster than format = "file"?
  2. Does the number of files have a large impact on the time it takes? Because in the first (slow) example, there are several thousands files to be monitored, whereas in the second example, there are only a few hundred (larger) files.
  3. I would like to stick with format = "file" since it's more flexible in my opinion. Is there anything I can do to speed things up?
wlandau commented 3 years ago

Is file_in() / file_out() generally faster than format = "file"?

On my system, format = "file" is actually faster, and neither is prohibitively slow. So I suspect other factors are at play here. If your data files sit on a remotely mounted network drive, that will slow things down a lot. Temporary storage, by comparison, is super fast (though not ideal for drake workflows).

> library(drake)
> tmp <- tempfile()
> saveRDS(rnorm(2e8), tmp, compress = FALSE)
> system(paste("du -h", tmp))
> plan1 <- drake_plan(x = target(tmp, format = "file"))
1.5G    /var/folders/k3/q1f45fsn4_13jbn0742d4zj40000gn/T//RtmpnpzE5P/file942c704805f7
> plan2 <- drake_plan(x = target(file_in(!!tmp)))
> digest::digest(tmp, file = TRUE)
[1] "0e2c2dbd19d4356ca71c7d1a74c4f585"
> system.time(make(plan1, verbose = 0))
ℹ Consider drake::r_make() to improve robustness.
   user  system elapsed 
  0.421   0.412   0.929 
> system.time(make(plan2, verbose = 0))
   user  system elapsed 
  0.690   0.753   1.450 

Does the number of files have a large impact on the time it takes? Because in the first (slow) example, there are several thousands files to be monitored, whereas in the second example, there are only a few hundred (larger) files.

If dealing with files is really the bottleneck, then runtime should scale linearly with the number of files. 100 large files is already a lot.

I would like to stick with format = "file" since it's more flexible in my opinion. Is there anything I can do to speed things up?

I recommend using proffer or similar profiling package to identify the bottleneck. Otherwise, it is difficult to speculate about what might be slow.

matthiasgomolka commented 3 years ago

Thanks for the explanation. I could verify that format = "file" is indeed a little faster than file_in(). I'm profiling right now and will come back with the results.

Just one more question regarding your plan2: Why do you specify format = "file" there as well? Until now, my understanding was that I don't need that for file_in() to work. Does that change anything or is it just to make the goal of the target more explicit?

wlandau commented 3 years ago

Thanks for the explanation. I could verify that format = "file" is indeed a little faster than file_in(). I'm profiling right now and will come back with the results.

Nice. I am curious about the bottleneck. Do your data files live on a network drive?

Just one more question regarding your plan2: Why do you specify format = "file" there as well? Until now, my understanding was that I don't need that for file_in() to work. Does that change anything or is it just to make the goal of the target more explicit?

My mistake, I originally meant to write format = "file" only in plan1 and not plan2. Just updated the example in https://github.com/ropensci/drake/issues/1349#issuecomment-739067635.

matthiasgomolka commented 3 years ago

Do your data files live on a network drive?

Yes. And it seems as if there is no other bottleneck. Below are some screenshorts from {profvis} (I don't have the system requirements for {proffer} at work).

Do I understand correctly from the last screenshot that it takes quite a while to uncompress the RDS file(s) (containing what exactly?) in the cache? But there is no way to store these using qs, since I already use format = "file", right?

So I think, I basically have to live with it. Do I understand correctly that more jobs_preprocess in drake_config() should have a positive impact? (I already use jobs_preprocess = 8L)

I know this is suboptimal, but the full .Rprofvis file is > 100MB and I have no clue what information it includes so I'm reluctant to upload it here.

If this is too tedious for you, feel free to close the issue without any further ado.


image


image


image


image


image

wlandau commented 3 years ago

This makes sense to me. drake is just trying to check the existence and modification times of a large number of files on a remote network drive. Nothing drake itself can do to speed this up unfortunately.

If you have Linux, you can install proffer system requirements locally in your home folder with proffer:: install_go(). Mac and Windows might also have ways to install Go (and thus pprof) locally, I am not sure.