Add cluster + distributed fs documentation

xiaodaigh commented 4 years ago

I see in the docs that FileTree can work on a computer with multiple processes. But can I be used on a cluster? The file folder structure doesn't seem to lend well to distributed processes.

shashi commented 4 years ago

You would need a distributed file system. I don't think I have plans to make it work without that, because it feels like an orthogonal abstraction to FileTrees.

shashi commented 4 years ago

That said I should add this to the docs.

shashi commented 4 years ago

If anybody tries this, I'd love to get the concise list of commands used. If not, I will try it myself soon.

ym-han commented 4 years ago

I plan to write a tutorial sometime in the next few weeks. But I thought it might be helpful to put the code I used out there first; there probably aren't any surprises, honestly:

SLURM SCRIPT

Basically what you'd expect; the following code is for an embarrassingly parallel problem; I didn't use any multithreading:

#!/bin/bash
#SBATCH --job-name="slurm_1987"
#SBATCH --output="slurm_1987.%j.%N.out"
#SBATCH --error="slurm_1987.%j.err"
#SBATCH -n 20
#SBATCH --export=ALL
#SBATCH -c 1
#SBATCH -t 28:00:00 
#SBATCH --mem-per-cpu=12G

module load julia/1.5.0
cd /users/yh31/scratch/projects/sr-nlp/data_gathering/source/jl_data_pipeline/src/prepping_entity_linker
julia --project=@. /users/yh31/scratch/projects/sr-nlp/data_gathering/source/jl_data_pipeline/src/prepping_entity_linker/main_slurm_87.jl

THE JULIA SCRIPT

try
     using ClusterManagers
catch
     import Pkg
     Pkg.add("ClusterManagers")
end

using ClusterManagers
addprocs_slurm(20, exeflags="--project=$(Base.active_project())")
# no need to add time or mem here

using Distributed

@everywhere begin
    import Pkg; Pkg.instantiate()
    include("/users/yh31/scratch/projects/sr-nlp/data_gathering/source/jl_data_pipeline/src/prepping_entity_linker/more_jl_code.jl")    
    const year = 1987
    const final_data_dir = "/users/yh31/scratch/datasets/entity_linking/final_data/prototype_aug_2020"
    const nyt_data_dir = "/gpfs/scratch/yh31/datasets/entity_linking/raw_data/nyt_corpus/data"
end

data_tree = FileTree(joinpath(nyt_data_dir, year))
linked_dfs = FileTrees.load(data_tree, lazy = true) do f
        @_ f |> string(FileTrees.path(__))|> 
                   get_data_from_file(__) |> 
                               clean!(__) |> 
                          entity_link(__) 
end

# Add corefs
coref_dfs = mapvalues(linked_dfs) do df
    @_ df |> coref_chains!(__)  |> 
        corefs_per_entity!(__)  |> 
        coref_chain_cleanup!(__)
end

out_tree= FileTrees.rename(coref_dfs, joinpath(final_data_dir, year))

FileTrees.save(out_tree) do f
    savejdf(string(FileTrees.path(f)) * ".jdf", FileTrees.get(f))
end

Basically the only slurm-related things I had to do were the sort of things one would have to do for any sort of distributed computing in julia on Slurm (though if one were doing multithreading, one would need to add the 1 - 2 lines of code you've posted in the ultimate guide to dist computing thread).

ym-han commented 4 years ago

Oh and I ended up just using blank dfs instead of FileTrees.NoValue(), even after you had added stuff to the master branch for that; I can't remember now why I thought blank dfs were still better than NoValue.

DrChainsaw commented 3 years ago

Hi,

I just tried out FileTrees with LSF and ClusterManagers and I got things to work pretty much out of the box.

There are a few gottchas though which maybe can be smoothed out (or maybe are impossible to fix, I dunno).

From reading the dagger documentation I get the impression that the scheduler is initialized with the workers it can use. Does that mean that one needs to wait for addprocs_xxx to initialize all workers before FileTrees can start processing or else new workers will not be used? I guess I could make some experiments to verify if this is the case, but I'm hoping @shashi knows the answer already :)
Also somewhat unverfied: If one worker crashes it seems like the whole result it lost. Is it possible to obtain a partial result for this case? For simple cases (pure mappings and reductions to a single value) it seems like it should be possible to just resume operations from where things stopped working.

shashi / FileTrees.jl

Add cluster + distributed fs documentation #29