Very slow parallel performance on large file trees

bgroenks96 commented 2 years ago

I noticed that lazy computations on my FileTree were really, really slow (and always slower than sequential evaluation), so I did a quick test:

using Distributed
addprocs(exeflags=["--sysimage JuliaSysImage.so", "--threads=1", "--project"])
@everywhere using FileTrees, Distributed
tree = FileTree("my/path/here")

where the path is replaced with some file tree (in my case it is a tree consisting of about 3600 files). This shouldn't matter that much for the test, I would think.

I run lazy FileTrees.load with a simple throwaway operation to check the overhead from using parallel processing. Now, I would fully expect parallelization to not pay off here given the simplicity of the operation. What is shocking (and quite frankly intolerable), however, is that the overhead adds several orders of magnitude to wall clock execution time.

julia> @time exec(FileTrees.load(f -> 2+2, tree, lazy=true));
614.403524 seconds (365.52 M allocations: 11.479 GiB, 1.19% gc time, 0.07% compilation time)

julia> @time exec(FileTrees.load(f -> 2+2, tree, lazy=true));
103.727996 seconds (275.26 M allocations: 5.656 GiB, 2.51% gc time, 0.32% compilation time)

julia> @time exec(FileTrees.load(f -> 2+2, tree));
  1.112900 seconds (1.46 M allocations: 94.482 MiB, 3.98% gc time, 53.46% compilation time)

The first attempt took over 10 minutes compared to 1 second for sequential execution. This looks to be precompilation related given the speed-up on the second attempt. However, even then, it is 100x slower than sequential evaluation, which I think far exceeds reasonable overhead even for such a simple operation.

I am running this on a remote machine which is using a networked file system, so ~~I suspect that this may also be part of the problem.~~

Am I missing anything or doing something stupid here? Is this a known issue?

EDIT: I updated the title to better reflect the suspected issue, per the feedback of @DrChainsaw .

DrChainsaw commented 2 years ago

I can reproduce (although I get about half the times). Seems like something is cursed with the Thunks from FileTrees.load:

tree = maketree("root" => string.(1:3600));

addprocs(exeflags=["--threads=1", "--project"])
4-element Vector{Int64}:
 2
 3
 4
 5

@everywhere using FileTrees , Distributed

using FileTrees.Dagger

function dtest(thunks)
           @time chunks = compute(delayed((xs...) -> [xs...]; meta=true)(thunks...))
           @time asyncmap(collect, chunks)
end

# first runs excluded as I moved things around and cba to restart. First run took about 300 seconds for me

# What we want to compute
thunks_okish = [delayed(+)(2,2) for _ in 1:3600];

# Not great, but better than below
@time dtest(thunks_okish);
 #12.019530 seconds (2.65 M allocations: 187.081 MiB, 0.79% gc time, 1.34% compilation time)
  #0.174506 seconds (294.31 k allocations: 12.355 MiB)
 #12.221474 seconds (3.01 M allocations: 203.461 MiB, 0.77% gc time, 1.54% compilation time)

# Same thing from the FileTree
thunks_notok = values(FileTrees.load(f -> 2+2, tree, lazy=true));

@time dtest(thunks_notok);
 #47.549520 seconds (80.41 M allocations: 2.002 GiB, 1.34% gc time, 0.05% compilation time)
  #0.215092 seconds (267.86 k allocations: 11.501 MiB)
 #47.764741 seconds (80.68 M allocations: 2.013 GiB, 1.33% gc time, 0.05% compilation time)

# Maybe the anonymous funciton or some type instability with File is the problem?
thunks_file = [delayed(f -> 2+2)(File(nothing, "root/$i")) for i in 1:3600];

# Does not seem like it
@time dtest(thunks_file);
 #12.157954 seconds (2.57 M allocations: 182.432 MiB, 0.94% gc time, 0.34% compilation time)
  #0.170773 seconds (293.11 k allocations: 12.300 MiB)
 #12.328912 seconds (2.87 M allocations: 194.741 MiB, 0.92% gc time, 0.33% compilation time)

# What if we add the whole tree as the parent
thunks_filetree = [delayed(f -> 2+2)(File(tree, "root/$i")) for i in 1:3600];

# That makes the difference!
@time dtest(thunks_filetree);
 #48.115307 seconds (80.62 M allocations: 2.012 GiB, 2.14% gc time, 0.04% compilation time)
  #0.190724 seconds (291.17 k allocations: 12.188 MiB)
 #48.306223 seconds (80.92 M allocations: 2.024 GiB, 2.13% gc time, 0.04% compilation time)

Could it be so that the time is spent hauling around the whole FileTree to every single worker? The big tree is the parent of each File in both the load case and the thunks_filetree case. This should probably be improved in FileTrees so that the load function only takes the filename if this is the case.

Here is a workaround that does just that:

tree_names = FileTrees.load(name, tree);

@time exec(mapvalues(filename -> 2+2, tree_names; lazy=true));
 #13.425377 seconds (3.93 M allocations: 260.063 MiB, 0.92% gc time, 3.84% compilation time)

Fwiw, I first accidentally tried to reproduce with Dagger 0.14.1 and then I got much better times for subsequent runs:

@time exec(FileTrees.load(f -> 2+2, tree, lazy=true));
#266.588155 seconds (48.29 M allocations: 3.034 GiB, 0.78% gc time, 0.29% compilation time)

@time exec(FileTrees.load(f -> 2+2, tree, lazy=true));
#2.548957 seconds (3.03 M allocations: 184.613 MiB, 2.99% gc time, 8.65% compilation time)

@time exec(FileTrees.load(f -> 2+2, tree, lazy=true));
#2.948732 seconds (3.65 M allocations: 220.772 MiB, 2.44% gc time, 18.68% compilation time)

@time exec(FileTrees.load(f -> 2+2, tree));
#0.776767 seconds (1.06 M allocations: 63.422 MiB, 63.10% compilation time)

@jpsamaroo : Any hints here? Is there a performance regression in Dagger between 0.14.1 and 0.14.3?

As a side note, is there a way to have Dagger compute multiple Thunks in parallel without having to resort to the splatting and vcat:ing done above (same as FileTrees does)? Could using the dynamic API be a better approach?

bgroenks96 commented 2 years ago

I didn't realize the whole tree was being passed around between the workers! That would probably explain it.

But does this mean it has nothing to do with the NFS?

DrChainsaw commented 2 years ago

I didn't realize the whole tree was being passed around between the workers! That would probably explain it.

I didn't realize that either at first :) I think this is a bit of an oopsie. Note that it only seems to happen for load and save. The other "main workhorse" functions (map, mapvalues, reducevalues) don't pass the File.

But does this mean it has nothing to do with the NFS?

The expression you timed does not touch the file system. The only interaction was when you created the tree (where it traversed the directory structure without looking at the file contents). After that the whole structure is an in memory copy and you would only access the file system if you actually used the name of a File to read the contents of the file with the same name in the load function.

With the disclaimer that my familarity with the design of FileTrees mostly comes from reverse engineering, it is not tightly coupled to files at all. It is more of a generic in-memory tree-structure which 1) has built-ins for creating a tree from a file structure (and vice versa) and 2) happens to borrow some names for manipulating the structure (e.g. mv, cp). The discussion in #9 seems to agree with this.

bgroenks96 commented 2 years ago

The expression you timed does not touch the file system.

Right, that makes sense, but surely it has to query the file system at least to build the FileTree right? I am just wondering if, in the process of traversing the tree or communicating/transferring the tree to the workers, it does this repeatedly or recursively. This could cause a slow down on NFS because filesystem operations like querying file lists tends to carry a lot more overhead than on a normal, local filesystem.

DrChainsaw commented 2 years ago

Right, that makes sense, but surely it has to query the file system at least to build the FileTree right?

Yes, that happens locally in this call:

tree = FileTree("my/path/here")

After that you have an in-memory tree structure which just happens to match the file structure (as it was created from it), but is in no way connected to it. For example, you can remove/add Files and FileTrees (directories) to the tree and it has no effect at all on the filesystem.

Didn't mean to imply that there can be no issues with NFS, just that the poor performance in the OP does not seem to be related to NFS.

I am just wondering if, in the process of traversing the tree or communicating/transferring the tree to the workers, it does this repeatedly or recursively.

It should not be anything unexpected happening here. The function supplied in load is called once per File in the tree and this call is scheduled as a separate Thunk by Dagger. Any recursion/traversal of the tree structure happens locally and only on the in-memory tree (with the exception of the initial creation above and save). You would only access the filesystem if you actually used the path of the File to access the filesystem (which you typically do) and then you would only make that access.

The "passing the whole tree" concern is just that the whole tree might be big and transferring it when the Thunk should only work on a tiny part of it is not very elegant.

When loading there is ofc not alot of data attached to each file, but there could be something about serializing/deserializing a complex data structure even if it is not very sizey (I have no idea tbh). As seen in the Dagger 0.14.1 benchmarks, it does not seem to be a disaster so maybe there is something w.r.t Dagger one needs to understand better here.

bgroenks96 commented 2 years ago

Ok, I see. What about the values associated with each node? Would they also be transferred for every Thunk?

Edit:

When loading there is ofc not alot of data attached to each file

I guess that actually answers this question :sweat_smile:

DrChainsaw commented 2 years ago

Ok, I see. What about the values associated with each node? Would they also be transferred for every Thunk

Values might be moved as the result of applying further transformations of the data (eg. through mapvalues and reducevalues) as each call to such functions will create new Thunks which depends on the previous Thunks.

FileTrees delegates the compromise of moving the values around vs not using all available compute resources when executing to Daggers scheduler.

jpsamaroo commented 2 years ago

I'm on vacation right now and don't have time to deeply investigate, but I will make a note to look into this when I get back on Monday!

jpsamaroo commented 2 years ago

I can reproduce this, even with https://github.com/JuliaParallel/Dagger.jl/pull/334. Performance is only bad with workers; using threads works fine, although there is no significant speedup. The reasons: compile time and GC time. Seems like we're spending 50%+ time compiling code, and 10-30% doing GC. (Note that this isn't easy to see with workers, which is why I often double-check with threads-only).

I've verified with SnoopCompile that most of the inference time is in Dagger, so I'll see how much of it I can fix, before looking at GC overhead.

DrChainsaw commented 2 years ago

I've verified with SnoopCompile that most of the inference time is in Dagger, so I'll see how much of it I can fix, before looking at GC overhead.

Awesome, thanks alot!

Here are numbers I get with Dagger 0.14.3 and current master:

# Dagger 0.14.3
using FileTrees, Distributed

addprocs(exeflags=["--threads=1", "--project"]);

# Note: Different number of files (and also different machince iirc) compared to above.
tree = maketree("root" => string.(1:2000));

@everywhere using FileTrees, Distributed

@time exec(FileTrees.load(f -> 2+2, tree, lazy=true));
 #34.666125 seconds (47.18 M allocations: 1.800 GiB, 2.27% gc time, 36.52% compilation time)

@time exec(FileTrees.load(f -> 2+2, tree, lazy=true));
 #13.053516 seconds (26.61 M allocations: 726.402 MiB, 2.46% gc time, 2.00% compilation time)

# Dagger#master
@time exec(FileTrees.load(f -> 2+2, tree, lazy=true));
 #34.271587 seconds (45.26 M allocations: 1.691 GiB, 2.09% gc time, 35.65% compilation time)

@time exec(FileTrees.load(f -> 2+2, tree, lazy=true));
 #12.972358 seconds (26.36 M allocations: 705.501 MiB, 2.63% gc time, 3.00% compilation time)

Do you have any thoughts on the performance implications of passing the whole FileTree to load and save (which only load and save a single file)? Timing just the serialize and deserialize does not seem to show any large overhead, but perhaps things get worse when deserialization happens on another worker:

julia> iob = PipeBuffer();

julia> @time serialize(iob, tree);
  0.669635 seconds (1.32 M allocations: 76.045 MiB, 5.99% gc time, 92.14% compilation time)

julia> @time deserialize(iob);
  0.375035 seconds (614.09 k allocations: 32.761 MiB, 3.98% gc time, 99.05% compilation time)

shashi / FileTrees.jl

Very slow parallel performance on large file trees #63