Closed matthiasgomolka closed 3 years ago
Is file_in() / file_out() generally faster than format = "file"?
On my system, format = "file"
is actually faster, and neither is prohibitively slow. So I suspect other factors are at play here. If your data files sit on a remotely mounted network drive, that will slow things down a lot. Temporary storage, by comparison, is super fast (though not ideal for drake
workflows).
> library(drake)
> tmp <- tempfile()
> saveRDS(rnorm(2e8), tmp, compress = FALSE)
> system(paste("du -h", tmp))
> plan1 <- drake_plan(x = target(tmp, format = "file"))
1.5G /var/folders/k3/q1f45fsn4_13jbn0742d4zj40000gn/T//RtmpnpzE5P/file942c704805f7
> plan2 <- drake_plan(x = target(file_in(!!tmp)))
> digest::digest(tmp, file = TRUE)
[1] "0e2c2dbd19d4356ca71c7d1a74c4f585"
> system.time(make(plan1, verbose = 0))
ℹ Consider drake::r_make() to improve robustness.
user system elapsed
0.421 0.412 0.929
> system.time(make(plan2, verbose = 0))
user system elapsed
0.690 0.753 1.450
Does the number of files have a large impact on the time it takes? Because in the first (slow) example, there are several thousands files to be monitored, whereas in the second example, there are only a few hundred (larger) files.
If dealing with files is really the bottleneck, then runtime should scale linearly with the number of files. 100 large files is already a lot.
I would like to stick with format = "file" since it's more flexible in my opinion. Is there anything I can do to speed things up?
I recommend using proffer
or similar profiling package to identify the bottleneck. Otherwise, it is difficult to speculate about what might be slow.
Thanks for the explanation. I could verify that format = "file"
is indeed a little faster than file_in()
. I'm profiling right now and will come back with the results.
Just one more question regarding your plan2
: Why do you specify format = "file"
there as well? Until now, my understanding was that I don't need that for file_in()
to work. Does that change anything or is it just to make the goal of the target more explicit?
Thanks for the explanation. I could verify that format = "file" is indeed a little faster than file_in(). I'm profiling right now and will come back with the results.
Nice. I am curious about the bottleneck. Do your data files live on a network drive?
Just one more question regarding your plan2: Why do you specify format = "file" there as well? Until now, my understanding was that I don't need that for file_in() to work. Does that change anything or is it just to make the goal of the target more explicit?
My mistake, I originally meant to write format = "file"
only in plan1
and not plan2
. Just updated the example in https://github.com/ropensci/drake/issues/1349#issuecomment-739067635.
Do your data files live on a network drive?
Yes. And it seems as if there is no other bottleneck. Below are some screenshorts from {profvis} (I don't have the system requirements for {proffer} at work).
Do I understand correctly from the last screenshot that it takes quite a while to uncompress the RDS file(s) (containing what exactly?) in the cache? But there is no way to store these using qs
, since I already use format = "file"
, right?
So I think, I basically have to live with it. Do I understand correctly that more jobs_preprocess
in drake_config()
should have a positive impact? (I already use jobs_preprocess = 8L
)
I know this is suboptimal, but the full .Rprofvis
file is > 100MB and I have no clue what information it includes so I'm reluctant to upload it here.
If this is too tedious for you, feel free to close the issue without any further ado.
This makes sense to me. drake
is just trying to check the existence and modification times of a large number of files on a remote network drive. Nothing drake
itself can do to speed this up unfortunately.
If you have Linux, you can install proffer
system requirements locally in your home folder with proffer:: install_go()
. Mac and Windows might also have ways to install Go (and thus pprof
) locally, I am not sure.
Prework
Question
I'm monitoring several hundreds of GB of files in a drake plan using dynamic branching. Thus, these files are monitored via
target(..., format = "file")
. When Imake()
the plan, it takes ~ 25 minutes until it actually starts because it takes so long toskip
(that's what it says in the log file) the targets withformat = "file"
.This is somewhat surprising, because in another drake plan where even bigger files are monitored - using
file_in()
andfile_out()
instead offormat = "file"
- importing these takes only ~ 1 minute.file_in()
/file_out()
generally faster thanformat = "file"
?format = "file"
since it's more flexible in my opinion. Is there anything I can do to speed things up?