Closed dimatkachuk closed 7 years ago
You can still have your baseline in time. For this, I would use small BPE vocabulary (30.000), and not big test and dev sets (50.000 and 25.000 tokens). You can also look for other teams scripts for reference, they are open source.
Regarding your queue issue, probably rocket CPU sub-servers are highly loaded now.
So, try to allocate GPU resources, and run your preprocessing scripts this way. Just tell SLURM that you want GPU for your job (example script is in materials repo) and execute the same command.
How is it going with your other issues: #5 #4 #3 #2 #1 ? Can you close some of them?
Okay, thank you. And can you tell is it possible to delete some jobs if they are already in the queue on rocket?
Try scancel JOB_ID
.
You can also google simple commands like this yourself. SLURM is not a local software developed at the UT. It is commonly used workload manager, and there are a lot of docs and info over the web.
OK, get it. thank you.
Tell me please whether GPU allocation worked for you. Do you still get this queue thing?
Will try it now. But can you tell is it okay that preprocess don't want to take -gpuid 0 parameter?
To answer this question, you can do following:
1) Consult OpenNMT-py docs on parameters command 2) See corresponding OpenNMT-py source files (explained today on practice session) 3) Look at lab1.pdf
Any of the options above will lead to the conclusion that preprocess.py is not supposed to take this parameter.
Hi. I still have not succeed with preprocessing script because I was staying with this script in queue for nearly 20 hours. Then I cancel it from the queue and start new one on another Tesla. Unfortunately, the time passes and new script is still in the queue.
Do you know a reason why your script is in the queue?
Unfortunately, I haven't found any reasons why it can't run it, except as HPC just overloaded. Cause I tried to run already on different GPUs and it still just stays in queue.
HPC is not overloaded.
Please, study output of squeue
command read SLURM docs.
What does squeue
for your process output?
Then I don't know. My output is JOBID PARTITION NAME USER. ST TIME NODES NODELIST(REASON) 2547986 gpu nmt_powe dmytro95 PD 0:00 1 (Resources)
Last column of this output, which is NODELIST (REASON)
shows the reason of your process is in a queue, and it says (Resources)
, which means you asked too many resources.
Can you please list here your SLURM script you use to run a job that is in the queue now?
Okay, I get my mistake. I took to much resources, it was 100G of resources, I put 50G and it is working.
It is not the only problem with your job. You need to fix your understanding of SLURM in general. Please list your SLURM script here (I would check it on GitHub if you pushed it regularly as stated in the lab text)
I also reached the same problem as my teammates: slurmstepd: error: Job 2548697 exceeded memory limit (52445460 > 52428800), being killed slurmstepd: error: Exceeded job memory limit slurmstepd: error: JOB 2548697 ON falcon2 CANCELLED AT 2017-10-05T13:22:02 slurmstepd: error: Exceeded step memory limit at some point.
I can try to push all .sh files, but it was problem that I can't push from rocket. I will try to.
As I remember, the problem was that you cloned a repo of another team. You do have push privileges for your repo. You even already pushed something 5 days ago :)
For now, you can just copy&paste the script you used to run the job that was in the queue and so I can see it here, in this thread.
################################ Concatenation ################################
cat raw-all/.en > demo-all.en cat raw-all/.et > demo-all.et
echo 'finished'
################################ Splitting Data ################################
paste demo-all.{et,en} | shuf > mixed-data.both
echo 'mixed-data generated'
sed -n 1,50000p mixed-data.both | cut -f 1 > test.et sed -n 1,50000p mixed-data.both | cut -f 2 > test.en sed -n 50001,75000p mixed-data.both | cut -f 1 > dev.et sed -n 50001,75000p mixed-data.both | cut -f 2 > dev.en sed -n 75001,19051439p mixed-data.both | cut -f 1 > train.et sed -n 75001,19051439p mixed-data.both | cut -f 2 > train.en
echo 'finished'
################################ Tokenization ################################
for f in {test,dev,train}.{en,et} do ../OpenNMT-py/tools/tokenizer.perl < $f > tok-$f done
echo 'finished'
################################ True casing ################################
../OpenNMT-py/tools/train-truecaser.perl --model en-truecase.mdl --corpus tok-train.en ../OpenNMT-py/tools/train-truecaser.perl --model et-truecase.mdl --corpus tok-train.et
for lang in en et do for f in {test,dev,train}.$lang do ../OpenNMT-py/tools/truecase.perl --model $lang-truecase.mdl < tok-$f > tc-tok-$f done done
echo 'finished'
################################ Filtering of training data ################################
../OpenNMT-py/tools/clean-corpus-n.perl tc-tok-train en et cleaned-tc-tok-train 1 100
echo 'finished'
################################ Filtering of dev data ################################
../OpenNMT-py/tools/clean-corpus-n.perl tc-tok-dev en et cleaned-tc-tok-dev 1 100
echo 'finished'
################################ Applying BPE ################################
cat cleaned-tc-tok-train.et cleaned-tc-tok-train.en | ../OpenNMT-py/tools/subword-nmt/learn_bpe.py -s 20000 > eten_sm.bpe
for lang in et en do ../OpenNMT-py/tools/subword-nmt/apply_bpe.py -c eten_sm.bpe < cleaned-tc-tok-train.$lang > bpe_sm.cleaned-tc-tok-train.$lang ../OpenNMT-py/tools/subword-nmt/apply_bpe.py -c eten_sm.bpe < cleaned-tc-tok-dev.$lang > bpe_sm.cleaned-tc-tok-dev.$lang ../OpenNMT-py/tools/subword-nmt/apply_bpe.py -c eten_sm.bpe < tc-tok-test.$lang > bpe_sm.tc-tok-test.$lang done
echo 'finished'
################################ Preporcessing ################################
module load python-2.7.13
python preprocess.py -train_src ../data/bpe_sm.cleaned-tc-tok-train.et -train_tgt ../data/bpe_sm.cleaned-tc-tok-train.en -valid_src ../data/bpe_sm.cleaned-tc-tok-dev.et -valid_tgt ../data/bpe_sm.cleaned-tc-tok-dev.en -save_data ../data/r$
echo 'finished'
What do you think the line #SBATCH --gres=gpu:tesla:1
means?
This means we will get 1 gpu of type tesla for computations.
One GPU or first GPU?
one
As I understand from here https://slurm.schedmd.com/sbatch.html It is one.
It is correct and It is nice that you got it now.
I asked because you said:
Cause I tried to run already on different GPUs and it still just stays in the queue.
and was running a job requesting 5 gpus machines, with #SBATCH --gres=gpu:tesla:5
. It was the basic reason why that job was in the queue I guess.
You could also use a link from the lab1.pdf: link.
If you have other problems (e.g. with memory, or smth.) please create a separate issue if you need some help.
Yes, I know that I was trying this to give more memory, but now I run it only for 1, for 50G memory as it was in sample batch script.
Thanks.
Hello.
Our team have a problem, that we haven't put our model for training as we run slurs file for preprocessing, but to know that we get some mistake or know about exceeding memory on some step we need to wait in queue for too much time. Can you advise something?