guicho271828 / latplan

LatPlan : A domain-independent, image-based classical planner
85 stars 19 forks source link

Questions about train_propositional.sh #19

Closed aymeric75 closed 2 years ago

aymeric75 commented 2 years ago

Hello,

It seems that jbsub is a custom scheduler that we don't have access to. On my cluster one is using srun

So I tried to replace the first line (l.47) in train_propositional.sh that calls jbsub with srun, here is my file so far:


#!/bin/bash

set -e

trap exit SIGINT

ulimit -v 16000000000

export PYTHONUNBUFFERED=1
# sokoban problem 2 has the same small screen size as problem 0, and has more than 20000 states unlike problem 0.
# ('sokoban_image-20000-global-global-0-train.npz', array([56, 56,  3]), (3613, 1, 9408)) --- probelm 0 has only 3613 states!
# ('sokoban_image-20000-global-global-2-train.npz', array([56, 56,  3]), (19999, 1, 9408))
export skb_train=sokoban_image-20000-global-global-2-train
export SHELL=/bin/bash
export common

task (){
    script=$1 ; shift
    mode=$1
    # main training experiments. results are used for planning experiments

    $common $script $mode hanoi     4 4           {} $comment ::: 5000 ::: CubeSpaceAE_AMA{3,4}Conv
    $common $script $mode hanoi     3 9           {} $comment ::: 5000 ::: CubeSpaceAE_AMA{3,4}Conv
    $common $script $mode hanoi     4 9           {} $comment ::: 5000 ::: CubeSpaceAE_AMA{3,4}Conv
    $common $script $mode hanoi     5 9           {} $comment ::: 5000 ::: CubeSpaceAE_AMA{3,4}Conv
    $common $script $mode puzzle    mnist    3 3  {} $comment ::: 5000 ::: CubeSpaceAE_AMA{3,4}Conv
    $common $script $mode lightsout digital    5  {} $comment ::: 5000 ::: CubeSpaceAE_AMA{3,4}Conv
    $common $script $mode lightsout twisted    5  {} $comment ::: 5000 ::: CubeSpaceAE_AMA{3,4}Conv
    $common -queue x86_12h $script $mode puzzle    mandrill 4 4  {} $comment ::: 20000 ::: CubeSpaceAE_AMA3Conv
    $common -queue x86_24h $script $mode puzzle    mandrill 4 4  {} $comment ::: 20000 ::: CubeSpaceAE_AMA4Conv
    $common -queue x86_6h  $script $mode sokoban   $skb_train    {} $comment ::: 20000 ::: CubeSpaceAE_AMA3Conv
    $common -queue x86_12h $script $mode sokoban   $skb_train    {} $comment ::: 20000 ::: CubeSpaceAE_AMA4Conv
    $common -queue x86_12h $script $mode blocks    cylinders-4-flat {} $comment ::: 20000 ::: CubeSpaceAE_AMA3Conv
    $common -queue x86_24h $script $mode blocks    cylinders-4-flat {} $comment ::: 20000 ::: CubeSpaceAE_AMA4Conv
}

export -f task

proj=$(date +%Y%m%d%H%M)sae-planning
number=2

################################################################
## Train the network, and run plot, summary, dump for as the job finishes
#common="parallel -j 1 --keep-order jbsub -mem 16g -cores 1+1 -queue x86_6h -proj $proj -require 'v100||a100'"
common="parallel -j 1 --keep-order srun -N 1 -p g100_usr_interactive --gres=gpu:1 -proj $proj -require 'v100||a100'"

export comment=kltune$number
parallel -j 1 --keep-order task ./train_kltune.py learn_summary_plot_dump ::: {1..30}

exit

Which creates the error:

srun: fatal: Can not execute 202205230755sae-planning

I have hard time understanding what the "202205230755sae-planning" executable corresponds to, as well as what is the "-proj" argument of jbsub

Best regards

Aymeric

guicho271828 commented 2 years ago

-proj is just a tag to assign to jobs. in my experience both LFS, Torque had this feature, surely slurm has one too.