mt2017-tartu-shared-task / nmt-system-E

0 stars 1 forks source link

Queue time #6

Closed dimatkachuk closed 7 years ago

dimatkachuk commented 7 years ago

Hello.

Our team have a problem, that we haven't put our model for training as we run slurs file for preprocessing, but to know that we get some mistake or know about exceeding memory on some step we need to wait in queue for too much time. Can you advise something?

MaksymDel commented 7 years ago

You can still have your baseline in time. For this, I would use small BPE vocabulary (30.000), and not big test and dev sets (50.000 and 25.000 tokens). You can also look for other teams scripts for reference, they are open source.

Regarding your queue issue, probably rocket CPU sub-servers are highly loaded now.

So, try to allocate GPU resources, and run your preprocessing scripts this way. Just tell SLURM that you want GPU for your job (example script is in materials repo) and execute the same command.

How is it going with your other issues: #5 #4 #3 #2 #1 ? Can you close some of them?

dimatkachuk commented 7 years ago

Okay, thank you. And can you tell is it possible to delete some jobs if they are already in the queue on rocket?

MaksymDel commented 7 years ago

Try scancel JOB_ID.

You can also google simple commands like this yourself. SLURM is not a local software developed at the UT. It is commonly used workload manager, and there are a lot of docs and info over the web.

dimatkachuk commented 7 years ago

OK, get it. thank you.

MaksymDel commented 7 years ago

Tell me please whether GPU allocation worked for you. Do you still get this queue thing?

dimatkachuk commented 7 years ago

Will try it now. But can you tell is it okay that preprocess don't want to take -gpuid 0 parameter?

MaksymDel commented 7 years ago

To answer this question, you can do following:

1) Consult OpenNMT-py docs on parameters command 2) See corresponding OpenNMT-py source files (explained today on practice session) 3) Look at lab1.pdf

Any of the options above will lead to the conclusion that preprocess.py is not supposed to take this parameter.

dimatkachuk commented 7 years ago

Hi. I still have not succeed with preprocessing script because I was staying with this script in queue for nearly 20 hours. Then I cancel it from the queue and start new one on another Tesla. Unfortunately, the time passes and new script is still in the queue.

MaksymDel commented 7 years ago

Do you know a reason why your script is in the queue?

dimatkachuk commented 7 years ago

Unfortunately, I haven't found any reasons why it can't run it, except as HPC just overloaded. Cause I tried to run already on different GPUs and it still just stays in queue.

MaksymDel commented 7 years ago

HPC is not overloaded.

Please, study output of squeuecommand read SLURM docs.

What does squeue for your process output?

dimatkachuk commented 7 years ago

Then I don't know. My output is JOBID PARTITION NAME USER. ST TIME NODES NODELIST(REASON) 2547986 gpu nmt_powe dmytro95 PD 0:00 1 (Resources)

MaksymDel commented 7 years ago

Last column of this output, which is NODELIST (REASON) shows the reason of your process is in a queue, and it says (Resources), which means you asked too many resources.

Can you please list here your SLURM script you use to run a job that is in the queue now?

dimatkachuk commented 7 years ago

Okay, I get my mistake. I took to much resources, it was 100G of resources, I put 50G and it is working.

MaksymDel commented 7 years ago

It is not the only problem with your job. You need to fix your understanding of SLURM in general. Please list your SLURM script here (I would check it on GitHub if you pushed it regularly as stated in the lab text)

dimatkachuk commented 7 years ago

I also reached the same problem as my teammates: slurmstepd: error: Job 2548697 exceeded memory limit (52445460 > 52428800), being killed slurmstepd: error: Exceeded job memory limit slurmstepd: error: JOB 2548697 ON falcon2 CANCELLED AT 2017-10-05T13:22:02 slurmstepd: error: Exceeded step memory limit at some point.

dimatkachuk commented 7 years ago

I can try to push all .sh files, but it was problem that I can't push from rocket. I will try to.

MaksymDel commented 7 years ago

As I remember, the problem was that you cloned a repo of another team. You do have push privileges for your repo. You even already pushed something 5 days ago :)

For now, you can just copy&paste the script you used to run the job that was in the queue and so I can see it here, in this thread.

dimatkachuk commented 7 years ago

################################ Concatenation ################################

!/bin/bash

The name of the job is

SBATCH -J nmt_power_rangers_task

The job requires 1 compute node

SBATCH -N 1

The job requires 1 task per node

SBATCH --ntasks-per-node=1

The maximum walltime of the job is a 8 days

SBATCH -t 192:00:00

SBATCH --mem=50G

cat raw-all/.en > demo-all.en cat raw-all/.et > demo-all.et

echo 'finished'

################################ Splitting Data ################################

!/bin/bash

The name of the job is

SBATCH -J nmt_power_rangers_task

The job requires 1 compute node

SBATCH -N 1

The job requires 1 task per node

SBATCH --ntasks-per-node=1

The maximum walltime of the job is a 8 days

SBATCH -t 192:00:00

SBATCH --mem=50G

paste demo-all.{et,en} | shuf > mixed-data.both

echo 'mixed-data generated'

sed -n 1,50000p mixed-data.both | cut -f 1 > test.et sed -n 1,50000p mixed-data.both | cut -f 2 > test.en sed -n 50001,75000p mixed-data.both | cut -f 1 > dev.et sed -n 50001,75000p mixed-data.both | cut -f 2 > dev.en sed -n 75001,19051439p mixed-data.both | cut -f 1 > train.et sed -n 75001,19051439p mixed-data.both | cut -f 2 > train.en

echo 'finished'

################################ Tokenization ################################

!/bin/bash

The name of the job is

SBATCH -J nmt_power_rangers_task

The job requires 1 compute node

SBATCH -N 1

The job requires 1 task per node

SBATCH --ntasks-per-node=1

The maximum walltime of the job is a 8 days

SBATCH -t 192:00:00

SBATCH --mem=50G

for f in {test,dev,train}.{en,et} do ../OpenNMT-py/tools/tokenizer.perl < $f > tok-$f done

echo 'finished'

################################ True casing ################################

!/bin/bash

The name of the job is

SBATCH -J nmt_power_rangers_task

The job requires 1 compute node

SBATCH -N 1

The job requires 1 task per node

SBATCH --ntasks-per-node=1

The maximum walltime of the job is a 8 days

SBATCH -t 192:00:00

SBATCH --mem=50G

SBATCH --mail-type=ALL

SBATCH --mail-user=dimatkachukgf@gmail.com

../OpenNMT-py/tools/train-truecaser.perl --model en-truecase.mdl --corpus tok-train.en ../OpenNMT-py/tools/train-truecaser.perl --model et-truecase.mdl --corpus tok-train.et

for lang in en et do for f in {test,dev,train}.$lang do ../OpenNMT-py/tools/truecase.perl --model $lang-truecase.mdl < tok-$f > tc-tok-$f done done

echo 'finished'

################################ Filtering of training data ################################

!/bin/bash

The name of the job is

SBATCH -J nmt_power_rangers_task

The job requires 1 compute node

SBATCH -N 1

The job requires 1 task per node

SBATCH --ntasks-per-node=1

The maximum walltime of the job is a 8 days

SBATCH -t 192:00:00

SBATCH --mem=50G

../OpenNMT-py/tools/clean-corpus-n.perl tc-tok-train en et cleaned-tc-tok-train 1 100

echo 'finished'

################################ Filtering of dev data ################################

!/bin/bash

The name of the job is

SBATCH -J nmt_power_rangers_task

The job requires 1 compute node

SBATCH -N 1

The job requires 1 task per node

SBATCH --ntasks-per-node=1

The maximum walltime of the job is a 8 days

SBATCH -t 192:00:00

SBATCH --mem=50G

../OpenNMT-py/tools/clean-corpus-n.perl tc-tok-dev en et cleaned-tc-tok-dev 1 100

echo 'finished'

################################ Applying BPE ################################

!/bin/bash

The name of the job is

SBATCH -J nmt_power_rangers_task

The job requires 1 compute node

SBATCH -N 1

The job requires 1 task per node

SBATCH --ntasks-per-node=1

The maximum walltime of the job is a 8 days

SBATCH -t 192:00:00

SBATCH --mem=50G

SBATCH --mail-type=ALL

SBATCH --mail-user=dimatkachukgf@gmail.com

cat cleaned-tc-tok-train.et cleaned-tc-tok-train.en | ../OpenNMT-py/tools/subword-nmt/learn_bpe.py -s 20000 > eten_sm.bpe

for lang in et en do ../OpenNMT-py/tools/subword-nmt/apply_bpe.py -c eten_sm.bpe < cleaned-tc-tok-train.$lang > bpe_sm.cleaned-tc-tok-train.$lang ../OpenNMT-py/tools/subword-nmt/apply_bpe.py -c eten_sm.bpe < cleaned-tc-tok-dev.$lang > bpe_sm.cleaned-tc-tok-dev.$lang ../OpenNMT-py/tools/subword-nmt/apply_bpe.py -c eten_sm.bpe < tc-tok-test.$lang > bpe_sm.tc-tok-test.$lang done

echo 'finished'

################################ Preporcessing ################################

!/bin/bash

The name of the job is

SBATCH -J nmt_power_rangers_task

The job requires 1 compute node

SBATCH -N 1

The job requires 1 task per node

SBATCH --ntasks-per-node=1

The maximum walltime of the job is a 8 days

SBATCH -t 192:00:00

SBATCH --mail-type=ALL

SBATCH --mail-user=dimatkachukgf@gmail.com

SBATCH --mem=50G

Leave this here if you need a GPU for your job

SBATCH --partition=gpu

SBATCH --gres=gpu:tesla:1

module load python-2.7.13

python preprocess.py -train_src ../data/bpe_sm.cleaned-tc-tok-train.et -train_tgt ../data/bpe_sm.cleaned-tc-tok-train.en -valid_src ../data/bpe_sm.cleaned-tc-tok-dev.et -valid_tgt ../data/bpe_sm.cleaned-tc-tok-dev.en -save_data ../data/r$

echo 'finished'

MaksymDel commented 7 years ago

What do you think the line #SBATCH --gres=gpu:tesla:1 means?

dimatkachuk commented 7 years ago

This means we will get 1 gpu of type tesla for computations.

MaksymDel commented 7 years ago

One GPU or first GPU?

BreakINbaDs commented 7 years ago

one

dimatkachuk commented 7 years ago

As I understand from here https://slurm.schedmd.com/sbatch.html It is one.

MaksymDel commented 7 years ago

It is correct and It is nice that you got it now.

I asked because you said:

Cause I tried to run already on different GPUs and it still just stays in the queue.

and was running a job requesting 5 gpus machines, with #SBATCH --gres=gpu:tesla:5. It was the basic reason why that job was in the queue I guess.

You could also use a link from the lab1.pdf: link.

If you have other problems (e.g. with memory, or smth.) please create a separate issue if you need some help.

dimatkachuk commented 7 years ago

Yes, I know that I was trying this to give more memory, but now I run it only for 1, for 50G memory as it was in sample batch script.

Thanks.