metagenome-atlas / atlas

ATLAS - Three commands to start analyzing your metagenome data
https://metagenome-atlas.github.io/
BSD 3-Clause "New" or "Revised" License
377 stars 98 forks source link

How to submit atlas jobs on a cluster? #272

Closed SilasK closed 4 years ago

SilasK commented 4 years ago

Continue the discussion with @sofie8 from #258

At the moment I was running atlas2 like this: qsub -l nodes=1:ppn=36 -l pmem=5gb -l walltime=72:00:00 -A myproject myjobscript.pbs With the following config parameter:

# max cores per process
threads: 36
# Memory for most jobs especially from BBtools, which are memory demanding
mem: 160

This command runs atlas on one node for 72 h, with 36 processors. However, each step uses the full 36 processors. This means one step is executed after the other.

Atlas, thanks to the underlining snakemake has the possibility to submit itself each step as a separate job on your pbs cluster! In other words atlas runs the qsub command.

You can set up, the cluster support following the updated documentation https://metagenome-atlas.readthedocs.io/en/latest/usage/cluster.html

But keep in mind to set the threads to something like

# max cores per process
threads: 8
# Memory for most jobs especially from BBtools, which are memory demanding
mem: 90
assembly_memory: 160 #max capacity of your bigmem queue
Sofie8 commented 4 years ago

Hi Silas, thanks for the extra explanation! I made a new conda env, 2.3b, and have the cluster config file found. It am trying some settings. Especially the genecatalog was slow with CAT database. Can I just rerun this last part, or which modifications do you have in atlasb2.3 version, to predict either gene functions on assembly or on the mags? Now I did it on the mags, but I have some shallow metagenomes, for which mags are rather incomplete and I just prefer to run genecatalog on the assembly. With this cluster setup, I could try interproscan annotation maybe also?

SilasK commented 4 years ago

Hey Sofie,

All changes in Atlas 2.3 should be after the genome dereplication step, so running atlas 2.3 in the same working directory as before should not rerun assembly for example.

Important, double-check your config file with the config file for atlas 2.3. There are some small changes.

Especially in Execution parameters and annotations.

But to be sure you can always make a dry-run first. I suggest maybe to rename the genome and Genecatalog subdirectories.

Changes in 2.2 and 2.3 include: Dropping of CAT and using GTDB-tk and annotation of genes with eggnog-mapper2.

If you set:

genecatalog:
  source: contigs               
  clustermethod: linclust      
  minlength_nt: 100
  minid: 0.95                   
  coverage: 0.9
  extra: ""
  SubsetSize: 500000

The gene annotation is done on the contigs. All genes are clusters into a genecatalog, and the representatives are annotated with EggNog mapper.

Yes, you can then use the interproscan on the genecatalog

There is also a command-line executable for KOfamscan.

fconstancias commented 4 years ago

Hi Silas, Thanks a lot for all your effort developping this very usefull pipeline.

I am trying to generate a gene catalog from wastewater metagenomes but I am having difficultes to run metagenome-atlas on a PBS cluster. I followed the instruction to set up the cluster mode, run atlas init atlas init --db-dir /gpfs1/scratch/db/atlas --working-dir EZ --data-type metagenome --assembler megahit --threads=10 --skip-qc 01_QC then, everything worked fine running atlas run in dry mode :
atlas run -w EZ -c /gpfs1/scratch/Experiment2/test_gene_catalog/EZ/config.yaml --profile cluster --jobs 4 genecatalog -n but when I tried to run the pipeline I got the following error for every jobs :


submit command: qsub -N init_pre_assembly_processing -l nodes=1 ppn=4 mem=10gb walltime=3000 /gpfs1/scratch/EZ/Experiment2/test_gene_catalog/EZ/.snakemake/tmp.5_kxsb9d/snakejob.init_pre_assembly_processing.86.sh
Traceback (most recent call last):
  File "/home/.config/snakemake/cluster/scheduler.py", line 66, in <module>
    raise Exception("Job can't be submitted\n"+output.decode("utf-8")+error.decode("utf-8"))
Exception: Job can't be submitted
usage: qsub [-a date_time] [-A account_string] [-c interval]
        [-C directive_prefix] [-e path] [-f ] [-h ] [-I [-X]] [-j oe|eo] [-J X-Y[:Z]]
        [-k keep] [-l resource_list] [-m mail_options] [-M user_list]
        [-N jobname] [-o path] [-p priority] [-P project] [-q queue] [-r y|n]
        [-R o|e|oe] [-S path] [-u user_list] [-W otherattributes=value...]
        [-S path] [-u user_list] [-W otherattributes=value...]
        [-v variable_list] [-V ] [-z] [script | -- command [arg1 ...]]
       qsub --version

Error submitting jobscript (exit code 1):

Thanks in advance for your help.

SilasK commented 4 years ago

Well done, I think you did everything right.

Now, each cluster is a bit different. To make a tool that can submit jobs to all clusters without problems is probably impossible, but with some some adjustment it should work.

Are you aware of any constraints of you cluster? Different queues, max number of memory, max number of time? What is the default queue of your cluster system? Do you have some special queue for high memory jobs?

Unfortunately the log is not very helpful, but maybe if you run the following command you get more information, why the sub command failed.

qsub -N init_pre_assembly_processing -l nodes=1 ppn=4 mem=10gb walltime=3000 /gpfs1/scratch/EZ/Experiment2/test_gene_catalog/EZ/.snakemake/tmp.5_kxsb9d/snakejob.init_pre_assembly_processing.86.sh

This command should submit a job, with 4 threads, 10 gb memory for 3000s

fconstancias commented 4 years ago

Thanks for your quick reply.

There are different queues indeed with some limitations but the std queue seems not restricted.

Compute (AMD) nodes (190)
normal priority 
max 7000 cores
resource allocation parity is determined by group/project

Therefore, I set the ~/.config/snakemake/cluster/cluster_config.yaml specifying the std queue for all the steps.

__default__:
# default parameter for all rules
  #queue: std
  nodes: 1

# The following rules in atlas need need more time/memory.
# If you need to submit them to different queues you can configure this as outlined.

# run_megahit:
#   queue: std
# run_spades:
#   queue: std

#gtdb-tk classify uses 'large_mem' and log time
# classify:
#   queue: std

# run_checkm_lineage_wf:
#   queue: std

# run_all_checkm_lineage_wf:
#   queue: std

#You can overwrite values for specific rules

  #account: "florentin"
  #time:  # h
  #threads:

Unfortunately, couldn't find the .bash script under snakemake/tmp*/snakejob.init_pre_assembly_processing.86.sh, I guess it is removed soon after.

SilasK commented 4 years ago

~/.config/snakemake/cluster/cluster_config.yaml is a yaml file the # marks the begining of a comment.

__default__:
# default parameter for all rules
  queue: std
  nodes: 1
  account: "florentin"

This should be everything you need to specify. Probably the queue: std is even not necessary.

Tell me how it goes.

fconstancias commented 4 years ago

Thanks. I set the the ~/.config/snakemake/cluster/cluster_config.yaml to

__default__:
# default parameter for all rules
  queue: std
  nodes: 1
  account: "florentin"

I am facing a different error now.

Traceback (most recent call last):
  File "/home/florentin/miniconda3/envs/metagenome-atlas/lib/python3.6/site-packages/snakemake/__init__.py", line 627, in snakemake
    batch=batch,
  File "/home/florentin/miniconda3/envs/metagenome-atlas/lib/python3.6/site-packages/snakemake/workflow.py", line 844, in execute
    success = scheduler.schedule()
  File "/home/florentin/miniconda3/envs/metagenome-atlas/lib/python3.6/site-packages/snakemake/scheduler.py", line 364, in schedule
    self.run(job)
  File "/home/florentin/miniconda3/envs/metagenome-atlas/lib/python3.6/site-packages/snakemake/scheduler.py", line 383, in run
    error_callback=self._error,
  File "/home/florentin/miniconda3/envs/metagenome-atlas/lib/python3.6/site-packages/snakemake/executors.py", line 813, in run
    jobscript = self.get_jobscript(job)
  File "/home/florentin/miniconda3/envs/metagenome-atlas/lib/python3.6/site-packages/snakemake/executors.py", line 629, in get_jobscript
    f = job.format_wildcards(self.jobname, cluster=self.cluster_wildcards(job))
  File "/home/florentin/miniconda3/envs/metagenome-atlas/lib/python3.6/site-packages/snakemake/executors.py", line 701, in cluster_wildcards
    return Wildcards(fromdict=self.cluster_params(job))
  File "/home/florentin/miniconda3/envs/metagenome-atlas/lib/python3.6/site-packages/snakemake/executors.py", line 676, in cluster_params
    cluster = self.cluster_config.get("__default__", dict()).copy()
AttributeError: 'NoneType' object has no attribute 'copy'
[2020-01-13 18:55 CRITICAL] Command 'snakemake --snakefile /home/florentin/miniconda3/envs/metagenome-atlas/lib/python3.6/site-packages/atlas/Snakefile --directory /gpfs1/scratch/florentin/EZ/Experiment2/test_gene_catalog/EZ_atlas --jobs 4 --rerun-incomplete --configfile '/gpfs1/scratch/florentin/EZ/Experiment2/test_gene_catalog/EZ_atlas/config.yaml' --nolock  --profile cluster --use-conda --conda-prefix /gpfs1/scratch/florentin/db/atlas/conda_envs   genecatalog   ' returned non-zero exit status 1.
SilasK commented 4 years ago

Can you try to remove the comment line? and if it fails to send me the ~/.config/snakemake/cluster/cluster_config.yaml as a file in the comment.

fconstancias commented 4 years ago

Same error again :

atlas run -w EZ_atlas -c /gpfs1/scratch/florentin/EZ/Experiment2/test_gene_catalog/EZ_atlas/config.yaml --profile cluster --jobs 1 genecatalog
[2020-01-13 20:57 INFO] Executing: snakemake --snakefile /home/florentin/miniconda3/envs/metagenome-atlas/lib/python3.6/site-packages/atlas/Snakefile --directory /gpfs1/scratch/florentin/EZ/Experiment2/test_gene_catalog/EZ_atlas --jobs 1 --rerun-incomplete --configfile '/gpfs1/scratch/florentin/EZ/Experiment2/test_gene_catalog/EZ_atlas/config.yaml' --nolock  --profile cluster --use-conda --conda-prefix /gpfs1/scratch/florentin/db/atlas/conda_envs   genecatalog   

Didn't find raw reads in sampleTable - skip QC
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cluster nodes: 1
Job counts:
        count   jobs
        1       add_eggNOG_header
        7       align_reads_to_prefilter_contigs
        1       cluster_genes
        1       combine_egg_nogg_annotations
        1       concat_genes
        7       error_correction
        7       filter_by_coverage
        1       filter_genes
        7       finalize_contigs
        1       gene_subsets
        1       genecatalog
        1       get_rep_proteins
        7       init_pre_assembly_processing
        7       merge_pairs
        7       pileup_prefilter
        7       predict_genes
        7       rename_contigs
        1       rename_gene_catalog
        7       rename_megahit_output
        1       rename_protein_catalog
        7       run_megahit
        87

[Mon Jan 13 20:57:15 2020]
rule init_pre_assembly_processing:
    input: /gpfs1/scratch/florentin/EZ/Experiment2/test_gene_catalog/01_QC/ESMetFM09_R1_01M.fastq.gz, /gpfs1/scratch/florentin/EZ/Experiment2/test_gene_catalog/01_QC/ESMetFM09_R2_01M.fastq.gz
    output: ESMetFM09-01M/assembly/reads/QC_R1.fastq.gz, ESMetFM09-01M/assembly/reads/QC_R2.fastq.gz
    log: ESMetFM09-01M/logs/assembly/init.log
    jobid: 82
    wildcards: sample=ESMetFM09-01M
    threads: 4
    resources: mem=10, java_mem=8, time=0.5

Traceback (most recent call last):
  File "/home/florentin/miniconda3/envs/metagenome-atlas/lib/python3.6/site-packages/snakemake/__init__.py", line 627, in snakemake
    batch=batch,
  File "/home/florentin/miniconda3/envs/metagenome-atlas/lib/python3.6/site-packages/snakemake/workflow.py", line 844, in execute
    success = scheduler.schedule()
  File "/home/florentin/miniconda3/envs/metagenome-atlas/lib/python3.6/site-packages/snakemake/scheduler.py", line 364, in schedule
    self.run(job)
  File "/home/florentin/miniconda3/envs/metagenome-atlas/lib/python3.6/site-packages/snakemake/scheduler.py", line 383, in run
    error_callback=self._error,
  File "/home/florentin/miniconda3/envs/metagenome-atlas/lib/python3.6/site-packages/snakemake/executors.py", line 813, in run
    jobscript = self.get_jobscript(job)
  File "/home/florentin/miniconda3/envs/metagenome-atlas/lib/python3.6/site-packages/snakemake/executors.py", line 629, in get_jobscript
    f = job.format_wildcards(self.jobname, cluster=self.cluster_wildcards(job))
  File "/home/florentin/miniconda3/envs/metagenome-atlas/lib/python3.6/site-packages/snakemake/executors.py", line 701, in cluster_wildcards
    return Wildcards(fromdict=self.cluster_params(job))
  File "/home/florentin/miniconda3/envs/metagenome-atlas/lib/python3.6/site-packages/snakemake/executors.py", line 676, in cluster_params
    cluster = self.cluster_config.get("__default__", dict()).copy()
AttributeError: 'NoneType' object has no attribute 'copy'
[2020-01-13 20:57 CRITICAL] Command 'snakemake --snakefile /home/florentin/miniconda3/envs/metagenome-atlas/lib/python3.6/site-packages/atlas/Snakefile --directory /gpfs1/scratch/florentin/EZ/Experiment2/test_gene_catalog/EZ_atlas --jobs 1 --rerun-incomplete --configfile '/gpfs1/scratch/florentin/EZ/Experiment2/test_gene_catalog/EZ_atlas/config.yaml' --nolock  --profile cluster --use-conda --conda-prefix /gpfs1/scratch/florentin/db/atlas/conda_envs   genecatalog   ' returned non-zero exit status 1.

Please find enclosed the yaml file - had to use .txt extension otherwise github was unhappy.

config.yaml.txt

Thanks a ton for your help.

SilasK commented 4 years ago

Sorry, could you send me the ~/.config/snakemake/cluster/cluster_config.yaml

fconstancias commented 4 years ago

my bad, here it is : cluster_config.yaml.txt

SilasK commented 4 years ago

The two spaces at the beginning are important:

__default__:
  queue: std
  nodes: 1
  account: "florentin"
SilasK commented 4 years ago

The mail clients don't show the code correctly, but you can see the correct version on Github.

fconstancias commented 4 years ago

Formatting properly the ~/.config/snakemake/cluster/cluster_config.yaml file did not solve the issue yet.

However, I noticed that the way cpu ressources are configured on the pbs cluster we are using here is different from what is configure for pbs clusters on atlas.

qsub -N init_pre_assembly_processing -q std **-l nodes=1 ppn=4** /gpfs1/scratch/florentin/EZ/Experiment2/test_gene_catalog/EZ_atlas/.snakemake/tmp.85m__d6v/snakejob.init_pre_assembly_processing.83.sh

This command gave me the following error suggesting that I am not giving qsub the proper options :

usage: qsub [-a date_time] [-A account_string] [-c interval]
        [-C directive_prefix] [-e path] [-f ] [-h ] [-I [-X]] [-j oe|eo] [-J X-Y[:Z]]
        [-k keep] [-l resource_list] [-m mail_options] [-M user_list]
        [-N jobname] [-o path] [-p priority] [-P project] [-q queue] [-r y|n]
        [-R o|e|oe] [-S path] [-u user_list] [-W otherattributes=value...]
        [-S path] [-u user_list] [-W otherattributes=value...]
        [-v variable_list] [-V ] [-z] [script | -- command [arg1 ...]]
       qsub --version

while, qsub -N init_pre_assembly_processing -q std **-l select=1:ncpus=1** /gpfs1/scratch/florentin/EZ/Experiment2/test_gene_catalog/EZ_atlas/.snakemake/tmp.85m__d6v/snakejob.init_pre_assembly_processing.83.sh seems to work ! 10444.pbs01

ls 01_QC EZ_atlas init_pre_assembly_processing.e10443 init_pre_assembly_processing.o10443

cat *43

Didn't find raw reads in sampleTable - skip QC Building DAG of jobs... Using shell: /bin/bash Provided cores: 64 Rules claiming more threads will be scaled down. Job counts: count jobs 1 init_pre_assembly_processing 1

[Mon Jan 13 23:36:12 2020] rule init_pre_assembly_processing: input: /gpfs1/scratch/florentin/EZ/Experiment2/test_gene_catalog/01_QC/ESMetFM37_R1_01M.fastq.gz, /gpfs1/scratch/florentin/EZ/Experiment2/test_gene_catalog/01_QC/ESMetFM37_R2_01M.fastq.gz output: ESMetFM37-01M/assembly/reads/QC_R1.fastq.gz, ESMetFM37-01M/assembly/reads/QC_R2.fastq.gz log: ESMetFM37-01M/logs/assembly/init.log jobid: 0 wildcards: sample=ESMetFM37-01M threads: 4 resources: mem=10, java_mem=8, time=0.5

Activating conda environment: /gpfs1/scratch/florentin/db/atlas/conda_envs/70a94580 [Mon Jan 13 23:36:16 2020] Finished job 0. 1 of 1 steps (100%) done

 Resource Usage on 2020-01-13 23:36:16.281939:
 JobId: 10443.pbs01                                  Project: _pbs_project_default 
 Submission Host: ln-0001.scelse.sg 
 Exit Status: 0
 NCPUs Requested: 1                                  NCPUs Used: 1
 Memory Requested: None                              Memory Used: 0kb 
 Vmem Used: 0kb
 CPU Time Used: 00:00:13 
 Walltime requested: None                    Walltime Used: 00:00:06
 Start Time: Mon Jan 13 23:36:09 2020 
 End Time: Mon Jan 13 23:36:16 2020 
 Execution Nodes Used: (ca-0003:ncpus=1)

Is there any easy way to modify so it generates qsub commands with -l select=x:ncpus=x instead of -l nodes=x ppn=x ?

Sofie8 commented 4 years ago

Hi Florentin,

I had the same, to resolve the first error: AttributeError: 'NoneType' object has no attribute 'copy' I removed spaces in the formatting of cluster.config.yaml cluster_config.zip

And then Silas, I did some modifications in the key_mapping.yaml key_mapping.zip : -l before mem -l before walltime and in bold here below I get a space between nodes=:ppn= how to change this in the key_mapping/yaml that it doesn't need a space? I see Florentin has the same issue, -l nodes=1 ppn=4

submit command: qsub -N initialize_qc -l nodes=2 :ppn=4 -l mem=10gb -l walltime =3000 /ddn1/vol1/site_scratch/leuven/314/vsc31426/Valeria/.snakemake/tmp.9yoj5h7 5/snakejob.initialize_qc.186.sh

pbs: command: "qsub" key_mapping: name: "-N {}" account: "-A {}" queue: #"-q {}" nodes: "-l nodes={}" threads: ":ppn={}" mem: "-l mem={}gb" time: "-l walltime={}00" #min= seconds x 100

I think if this issue nodes=:ppn= is solved, we should be able to submit the jobs.

SilasK commented 4 years ago

Is there any easy way to modify so it generates qsub commands with -l select=x:ncpus=x instead of -l nodes=x ppn=x ?

Yes, as Sofie points out the key_mapping.yaml is there for this reason.

I incorporated @Sofie8 changes in the profile. Have a look here and copy-paste the updated part into your key_mapping.yaml

I think if this issue nodes=:ppn= is solved, we should be able to submit the jobs.

I solved the problem by forcing to take always 1 node. I don't know how the tools in atlas would span multiple nodes anyway.

@Sofie8 I'm however not sure if you can supply multiple -l arguments to the qsub command. As I understand the logs of qsub it should be only one. But test it out.

@Sofie8 and @fconstancias Could I ask you to give me your version of qsub or pbs?

fconstancias commented 4 years ago

qsub --version

pbs_version = 18.2.3.20181206140456

SilasK commented 4 years ago

Could you make atlas submit jobs by performing these changes?

fconstancias commented 4 years ago

Unfortunatly no, I have now the following error :

... [Tue Jan 14 18:29:57 2020] rule init_pre_assembly_processing: input: /gpfs1/scratch/florentin/EZ/Experiment2/test_gene_catalog/01_QC/ESMetFM37_R1_01M.fastq.gz, /gpfs1/scratch/florentin/EZ/Experiment2/test_gene_catalog/01_QC/ESMetFM37_R2_01M.fastq.gz output: ESMetFM37-01M/assembly/reads/QC_R1.fastq.gz, ESMetFM37-01M/assembly/reads/QC_R2.fastq.gz log: ESMetFM37-01M/logs/assembly/init.log jobid: 83 wildcards: sample=ESMetFM37-01M threads: 4 resources: mem=10, java_mem=8, time=0.5

Traceback (most recent call last): File "/home/florentin/.config/snakemake/cluster/scheduler.py", line 49, in command= command_options[system]['command'] KeyError: '{{cookiecutter.cluster_system}}' Error submitting jobscript (exit code 1):

Shutting down, this might take some time. Exiting because a job execution failed. Look above for error message Note the path to the log file for debugging. Documentation is available at: https://metagenome-atlas.readthedocs.io Issues can be raised at: https://github.com/metagenome-atlas/atlas/issues Complete log: /gpfs1/scratch/florentin/EZ/Experiment2/test_gene_catalog/EZ_atlas/.snakemake/log/2020-01-14T182955.862075.snakemake.log [2020-01-14 18:30 CRITICAL] Command 'snakemake --snakefile /home/florentin/miniconda3/envs/metagenome-atlas/lib/python3.6/site-packages/atlas/Snakefile --directory /gpfs1/scratch/florentin/EZ/Experiment2/test_gene_catalog/EZ_atlas --jobs 4 --rerun-incomplete --configfile '/gpfs1/scratch/florentin/EZ/Experiment2/test_gene_catalog/EZ_atlas/config.yaml' --nolock --profile cluster --use-conda --conda-prefix /gpfs1/scratch/florentin/db/atlas/conda_envs genecatalog ' returned non-zero exit status 1.

fconstancias commented 4 years ago

Hi @Sofie8, thanks for your input !

@SilasK, please see here the yaml files.

cat ~/.config/snakemake/cluster/cluster_config.yaml

__default__:
  queue: std
  nodes: 1
cat ~/.config/snakemake/cluster/key_mapping.yaml

# only parameters defined in key_mapping (see below) are passed to the command in the order specified.
system: "{{cookiecutter.cluster_system}}" #check if system is defined below

slurm:
  command: "sbatch --parsable"
  key_mapping:
    name: "--job-name={}"
    threads: "-n {}"
    mem: "--mem={}g"
    account: "--account={}"
    queue: "--partition={}"
    time: "--time={}"
    nodes: "-N {}"
pbs:
  command: "qsub"
  key_mapping:
    name: "-N {}"
    account: "-A {}"
    queue: "-l partition={}"
    threads: "-l nodes=1:ppn={}" # always use 1 node
    mem: "-l mem={}gb"
    time: "-l walltime={}00" #min= seconds x 100
lsf:
  command: "bsub"
  key_mapping:
    name: "-J {}"
    threads: "-n {}"
    mem: "-M {}000000"
    account: "-P {}"
    queue: "-q {}"
    time: "-W {}"
    nodes: "-C {}"

# for other cluster systems see: https://slurm.schedmd.com/rosetta.pdf
SilasK commented 4 years ago

Replace the line system: "{{cookiecutter.cluster_system}}" in cat ~/.config/snakemake/cluster/key_mapping.yaml

with system: "pbs"

fconstancias commented 4 years ago

Thanks for your suggestion. Unfortunately it did not solve the issue :

submit command: qsub -N init_pre_assembly_processing -l partition=std -l nodes=1:ppn=4 -l mem=10gb -l walltime=3000 /gpfs1/scratch/florentin/EZ/Experiment2/test_gene_catalog/EZ_atlas/.snakemake/tmp.uzzu21do/snakejob.init_pre_assembly_processing.84.sh Traceback (most recent call last): File "/home/florentin/.config/snakemake/cluster/scheduler.py", line 66, in raise Exception("Job can't be submitted\n"+output.decode("utf-8")+error.decode("utf-8")) Exception: Job can't be submitted qsub: Cannot set attribute, read only or insufficient permission Resource_List.partition

Error submitting jobscript (exit code 1):

SilasK commented 4 years ago

Ok, If I understand qsub: Cannot set attribute, read only or insufficient permission Resource_List.partition

you just don't define any partition. Remove the line from the ~/.config/snakemake/cluster/cluster_config.yaml

fconstancias commented 4 years ago

Well I think there is an uncompatible parameter in ~/.config/snakemake/cluster/key_mapping.yaml with my qsub version. In my case, the queue is specified by -q argument. Hence, I have replaced queue: "-l partition={}" with queue: "-q {}" the ~/.config/snakemake/cluster/key_mapping.yaml

Well, it didn't solve everything but I think I am closer than ever :

...
submit command: qsub -N error_correction -q std -l nodes=1:ppn=10 -l mem=60gb -l walltime=30000 /gpfs1/scratch/florentin/EZ/Experiment2/test_gene_catalog/EZ_atlas/.snakemake/tmp.ne1r33ru/snakejob.error_correction.79.sh
Traceback (most recent call last):
  File "/home/florentin/.config/snakemake/cluster/scheduler.py", line 75, in <module>
    jobid= int(res.strip().split()[-1])
ValueError: invalid literal for int() with base 10: '10487.pbs01'

Some error related to the jobid.

and when I manually submit one of the job created by atlas, then it is working.

qsub -N error_correction -q std -l nodes=1:ppn=10 -l mem=60gb -l walltime=30000 /gpfs1/scratch/florentin/EZ/Experiment2/test_gene_catalog/EZ_atlas/.snakemake/tmp.ne1r33ru/snakejob.error_correction.79.sh
10488.pbs01

Thanks a ton for your help @SilasK

SilasK commented 4 years ago

OK, Seems you have job ids which are not strings but something like 10488.pbs01

If you have a running job, e.g. if you submit a job outside of atlas.

Can you try: qstat -f -x 10488.pbs01

and

qstat -f -x 10488

I also updated the cluster profile. You can remove the cluster folder and try to make a new one. It's the cookiecutter ... instruction.

Sofie8 commented 4 years ago

qsub -- version: Version: 6.1.3

@SilasK I have done the changes in key_mapping.yaml, now the manual job runs, but not when I submit the pbs script.

[Tue Jan 14 22:14:20 2020] rule initialize_qc: input: /ddn1/vol1/site_scratch/leuven/314/vsc31426/Valeria/raw/X5A2_R1.fastq.gz, /ddn1/vol1/site_scratch/leuven/314/vsc31426/Valeria/raw/X5A2_R2.fastq.gz output: X5A2/sequence_quality_control/X5A2_raw_R1.fastq.gz, X5A2/sequence_quality_control/X5A2_raw_R2.fastq.gz log: X5A2/logs/QC/init.log jobid: 198 wildcards: sample=X5A2 priority: 80 threads: 4 resources: mem=10, java_mem=8, time=0.5

submit command: qsub -N initialize_qc -A lp_h_microbe -l nodes=1:ppn=4 -l mem=10gb -l walltime=3000 /ddn1/vol1/site_scratch/leuven/314/vsc31426/Valeria/.snakemake/tmp.8olj0aea/snakejob.initialize_qc.198.sh Traceback (most recent call last): File "/ddn1/vol1/site_scratch/leuven/314/vsc31426/newatlas23beta/cluster/scheduler.py", line 66, in raise Exception("Job can't be submitted\n"+output.decode("utf-8")+error.decode("utf-8")) Exception: Job can't be submitted [index issues]

usage: qsub [-a date_time] [-A account_string] [-b secs] [-c [ none | { enabled | periodic | shutdown | depth= | dir= | interval=}... ] [-C directive_prefix] [-d path] [-D path] [-e path] [-h] [-I] [-j oe|eo|n] [-k {oe}] [-K ] [-l resource_list] [-m n|{abe}] [-M user_list] [-N jobname] [-o path] [-p priority] [-P proxy_user [-J <jobid]] [-q queue] [-r y|n] [-S path] [-t number_to_submit] [-T type] [-u user_list] [-w] path [-W additional_attributes] [-v variable_list] [-V ] [-x] [-X] [-z] [script]

Error submitting jobscript (exit code 1):

SilasK commented 4 years ago

now the manual job runs, What do you mean with this exactly?

There is no error message why the qsub script fails?

Assuming the jobscript /ddn1/vol1/site_scratch/leuven/314/vsc31426/Valeria/.snakemake/tmp.8olj0aea/snakejob.initialize_qc.198.sh still exists. Can you to submitt the following variations and see if they work or throw an understandable error message?

qsub -N initialize_qc -A lp_h_microbe -l nodes=1:ppn=4 -l mem=10gb -l walltime=3000 /ddn1/vol1/site_scratch/leuven/314/vsc31426/Valeria/.snakemake/tmp.8olj0aea/snakejob.initialize_qc.198.sh

qsub -N initialize_qc -A lp_h_microbe -l nodes=1:ppn=4 -l mem=10gb -l walltime=3000 -w e /ddn1/vol1/site_scratch/leuven/314/vsc31426/Valeria/.snakemake/tmp.8olj0aea/snakejob.initialize_qc.198.sh

Sofie8 commented 4 years ago

Yes with manual, I meant if I give myself the single qsub job.

So, the first works, second not.

✘ [Jan/15 10:20] vsc31426@tier2-p-login-4 /vsc-hard-mounts/leuven-data/314/vsc31426/miniconda3/envs/atlas23beta/atlas $ qsub -N init ialize_qc -A lp_h_microbe -l nodes=1:ppn=4 -l mem=10gb -l walltime=3000 /ddn1/vol1/site_scratch/leuven/314/vsc31426/Valeria/.snakema ke/tmp.1yu6gl4z/snakejob.initialize_qc.186.sh 50160613.tier2-p-moab-2.tier2.hpc.kuleuven.be ✔ [Jan/15 10:21] vsc31426@tier2-p-login-4 /vsc-hard-mounts/leuven-data/314/vsc31426/miniconda3/envs/atlas23beta/atlas $ qstat Job ID Name User Time Use S Queue


50160613.tier2-p-moab-2.tier2 initialize_qc vsc31426 0 Q q1h ✔ [Jan/15 10:21] vsc31426@tier2-p-login-4 /vsc-hard-mounts/leuven-data/314/vsc31426/miniconda3/envs/atlas23beta/atlas $ qstat Job ID Name User Time Use S Queue


50160613.tier2-p-moab-2.tier2 initialize_qc vsc31426 0 R q1h ✔ [Jan/15 10:22] vsc31426@tier2-p-login-4 /vsc-hard-mounts/leuven-data/314/vsc31426/miniconda3/envs/atlas23beta/atlas $ qsub -N init ialize_qc -A lp_h_microbe -l nodes=1:ppn=4 -l mem=10gb -l walltime=3000 -w e /ddn1/vol1/site_scratch/leuven/314/vsc31426/Valeria/.sn akemake/tmp.9yoj5h75/snakejob.initialize_qc.186.sh qsub: Requested working directory 'e' is not a valid directory Please specify a valid working directory.

full error log: atlas23beta_setupGM.pbs.zip

But I specify only to submit 5 jobs at the same time, but in the log, I see still it submits more? restart-times: 0 cluster-config: "/ddn1/vol1/site_scratch/leuven/314/vsc31426/newatlas23beta/cluster/cluster_config.yaml" #abs path cluster: "scheduler.py" # cluster-status: "pbs_status.py" # max-jobs-per-second: 1 max-status-checks-per-second: 1 cores: 5 # how many jobs you want to submit to your cluster queue local-cores: 1 rerun-incomplete: true # recomended for cluster submissions keep-going: false

The lines in the scheduler.py where the error occurs:

construct command:

for key in key_mapping: if key in cluster_param: command+=" " command+=key_mapping[key].format(cluster_param[key])

command+=' {}'.format(jobscript)

eprint("submit command: "+command)

p = Popen(command.split(' '), stdout=PIPE, stderr=PIPE) output, error = p.communicate() if p.returncode != 0: raise Exception("Job can't be submitted\n"+output.decode("utf-8")+error.decode("utf-8")) else: res= output.decode("utf-8")

My pbs script: atlas23beta_setupGM.zip

SilasK commented 4 years ago

submit commands

OK, maybe the Can you replace the line p = Popen(command.split(' '), stdout=PIPE, stderr=PIPE) with p = Popen(command), stdout=PIPE, stderr=PIPE)

status checks with qstat

50160613.tier2-p-moab-2.tier2.hpc.kuleuven.be is the jobid?

After submitting the command with the the jobid can you run: qstat -f -x 50160613.tier2-p-moab-2.tier2.hpc.kuleuven.be`` and qstat -f -x 50160613`

How many jobs should be run at the same time.

Apparently the 5 cores defined get overwritten by 36. I try to fix that in 3143127 but for now you can run atlas with --jobs 5 so only 5 jobs get submitted at the same time.

Running atlas in general.

I suggest you to set 8 threads in your atlas config.yaml and to adopt your pbs script as follows:

#!/bin/bash -l
#PBS -A lp_h_microbe
#PBS -l nodes=1:ppn=5 # less than 5 threads are needed. 
#PBS -l walltime=24:00:00 # or longer
#PBS -l pmem=20gb
#PBS -l partition=std # no need for bigmem is std the standard partion?
#PBS -m ae #what does this stand for?
#PBS -M sofie.thijs@uhasselt.be

module purge
#module load Java/1.8.0_171 # I don't think its used by Atlas

export LC_ALL=en_US.UTF-8
export LANG=en_US.UTF-8
export LANGUAGE=en_US.UTF-8

source activate atlas23beta
cd /vsc-hard-mounts/leuven-data/314/vsc31426/miniconda3/envs/atlas23beta/atlas

# to run:
atlas run all -w /ddn1/vol1/site_scratch/leuven/314/vsc31426/Valeria \
--profile /ddn1/vol1/site_scratch/leuven/314/vsc31426/newatlas23beta/cluster \
--jobs 5

You can run atlas also in a screen

fconstancias commented 4 years ago

if you have a running job, e.g. if you submit a job outside of atlas.

Can you try: qstat -f -x 10488.pbs01

and

qstat -f -x 10488

Following on your suggestion, I submitted a job generated by atlas but outside of it :

qsub -N merge_pairs -q std -l nodes=1:ppn=10 -l mem=60gb -l walltime=30000 /gpfs1/scratch/florentin/EZ/Experiment2/test_gene_catalog/EZ_atlas/.snakemake/tmp.6aq5vwnd/snakejob.merge_pairs.70.sh

both qstat -f -x with "jobid".pbs01 or "jobid" gave me the same output :

qstat -f -x 10528.pbs01

Job Id: 10528.pbs01 Job_Name = merge_pairs Job_Owner = florentin@ln-0001.scelse.sg resources_used.cpupercent = 672 resources_used.cput = 00:01:22 resources_used.mem = 2908512kb resources_used.ncpus = 10 resources_used.vmem = 62745940kb resources_used.walltime = 00:00:11 job_state = F queue = std server = pbs01 Checkpoint = u ctime = Wed Jan 15 21:13:16 2020 Error_Path = ln-0001.scelse.sg:/gpfs1/scratch/florentin/EZ/Experiment2/test _gene_catalog/merge_pairs.e10528 exec_host = ca-0011/0*10 exec_vnode = (ca-0011:ncpus=10:mem=62914560kb) Hold_Types = n Join_Path = n Keep_Files = n Mail_Points = a mtime = Wed Jan 15 21:13:27 2020 Output_Path = ln-0001.scelse.sg:/gpfs1/scratch/florentin/EZ/Experiment2/tes t_gene_catalog/merge_pairs.o10528 Priority = 0 qtime = Wed Jan 15 21:13:16 2020 Rerunable = True Resource_List.mem = 62914560kb Resource_List.mpiprocs = 10 Resource_List.ncpus = 10 Resource_List.nodect = 1 Resource_List.nodes = 1:ppn=10 Resource_List.place = scatter Resource_List.select = 1:ncpus=10:mem=62914560KB:mpiprocs=10 Resource_List.walltime = 08:20:00 stime = Wed Jan 15 21:13:16 2020 session_id = 125506 jobdir = /home/florentin substate = 92 Variable_List = PBS_O_SYSTEM=Linux,PBS_O_SHELL=/bin/bash, PBS_O_HOME=/home/florentin,PBS_O_LOGNAME=florentin, PBS_O_WORKDIR=/gpfs1/scratch/florentin/EZ/Experiment2/test_gene_catalo g,PBS_O_LANG=en_US.utf-8, PBS_O_PATH=/home/florentin/miniconda3/envs/metagenome-atlas/bin:/home/ florentin/miniconda3/condabin:/opt/gcc/6.1.0/bin:/cm/local/apps/environ ment-modules/4.0.0/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbi n:/sbin:/usr/sbin:/cm/local/apps/environment-modules/4.0.0/bin:/usr/lpp /mmfs/bin:/opt/ibutils/bin:/opt/pbs/bin:/home/florentin/.local/bin:/hom e/florentin/bin:/usr/lpp/mmfs/bin:/opt/pbs/bin:/home/florentin/.local/b in:/home/florentin/bin,PBS_O_MAIL=/var/spool/mail/florentin, PBS_O_QUEUE=std,PBS_O_HOST=ln-0001.scelse.sg comment = Job run at Wed Jan 15 at 21:13 on (ca-0011:ncpus=10:mem=62914560k b) and finished etime = Wed Jan 15 21:13:16 2020 run_count = 1 Stageout_status = 1 Exit_status = 0 Submit_arguments = -N merge_pairs -q std -l nodes=1:ppn=10 -l mem=60gb -l w alltime=30000 /gpfs1/scratch/florentin/EZ/Experiment2/test_gene_catalog /EZ_atlas/.snakemake/tmp.6aq5vwnd/snakejob.merge_pairs.70.sh history_timestamp = 1579094007 project = _pbs_project_default

qstat -f -x 10528

Job Id: 10528.pbs01 Job_Name = merge_pairs Job_Owner = florentin@ln-0001.scelse.sg resources_used.cpupercent = 672 resources_used.cput = 00:01:22 resources_used.mem = 2908512kb resources_used.ncpus = 10 resources_used.vmem = 62745940kb resources_used.walltime = 00:00:11 job_state = F queue = std server = pbs01 Checkpoint = u ctime = Wed Jan 15 21:13:16 2020 Error_Path = ln-0001.scelse.sg:/gpfs1/scratch/florentin/EZ/Experiment2/test _gene_catalog/merge_pairs.e10528 exec_host = ca-0011/0*10 exec_vnode = (ca-0011:ncpus=10:mem=62914560kb) Hold_Types = n Join_Path = n Keep_Files = n Mail_Points = a mtime = Wed Jan 15 21:13:27 2020 Output_Path = ln-0001.scelse.sg:/gpfs1/scratch/florentin/EZ/Experiment2/tes t_gene_catalog/merge_pairs.o10528 Priority = 0 qtime = Wed Jan 15 21:13:16 2020 Rerunable = True Resource_List.mem = 62914560kb Resource_List.mpiprocs = 10 Resource_List.ncpus = 10 Resource_List.nodect = 1 Resource_List.nodes = 1:ppn=10 Resource_List.place = scatter Resource_List.select = 1:ncpus=10:mem=62914560KB:mpiprocs=10 Resource_List.walltime = 08:20:00 stime = Wed Jan 15 21:13:16 2020 session_id = 125506 jobdir = /home/florentin substate = 92 Variable_List = PBS_O_SYSTEM=Linux,PBS_O_SHELL=/bin/bash, PBS_O_HOME=/home/florentin,PBS_O_LOGNAME=florentin, PBS_O_WORKDIR=/gpfs1/scratch/florentin/EZ/Experiment2/test_gene_catalo g,PBS_O_LANG=en_US.utf-8, PBS_O_PATH=/home/florentin/miniconda3/envs/metagenome-atlas/bin:/home/ florentin/miniconda3/condabin:/opt/gcc/6.1.0/bin:/cm/local/apps/environ ment-modules/4.0.0/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbi n:/sbin:/usr/sbin:/cm/local/apps/environment-modules/4.0.0/bin:/usr/lpp /mmfs/bin:/opt/ibutils/bin:/opt/pbs/bin:/home/florentin/.local/bin:/hom e/florentin/bin:/usr/lpp/mmfs/bin:/opt/pbs/bin:/home/florentin/.local/b in:/home/florentin/bin,PBS_O_MAIL=/var/spool/mail/florentin, PBS_O_QUEUE=std,PBS_O_HOST=ln-0001.scelse.sg comment = Job run at Wed Jan 15 at 21:13 on (ca-0011:ncpus=10:mem=62914560k b) and finished etime = Wed Jan 15 21:13:16 2020 run_count = 1 Stageout_status = 1 Exit_status = 0 Submit_arguments = -N merge_pairs -q std -l nodes=1:ppn=10 -l mem=60gb -l w alltime=30000 /gpfs1/scratch/florentin/EZ/Experiment2/test_gene_catalog /EZ_atlas/.snakemake/tmp.6aq5vwnd/snakejob.merge_pairs.70.sh history_timestamp = 1579094007 project = _pbs_project_default

Sofie8 commented 4 years ago

Hi Silas,

No, I made it worse looks like, now it cannot make the temp files. full error log: atlas23beta_setupGM.pbs.zip

submit command: qsub -N initialize_qc -A lp_h_microbe -l nodes=1:ppn=4 -l mem=10gb -l walltime=3000 /ddn1/vol1/site_scratch/leuven/314/vsc31426/Valeria/.snakemake/tmp.11aobbon/snakejob.initialize_qc.190.sh Traceback (most recent call last): File "/ddn1/vol1/site_scratch/leuven/314/vsc31426/newatlas23beta/cluster/scheduler.py", line 63, in p = Popen(command, stdout=PIPE, stderr=PIPE) File "/vsc-hard-mounts/leuven-data/314/vsc31426/miniconda3/envs/atlas23beta/lib/python3.6/subprocess.py", line 709, in init restore_signals, start_new_session) File "/vsc-hard-mounts/leuven-data/314/vsc31426/miniconda3/envs/atlas23beta/lib/python3.6/subprocess.py", line 1344, in _execute_child raise child_exception_type(errno_num, err_msg, err_filename) FileNotFoundError: [Errno 2] No such file or directory: 'qsub -N initialize_qc -A lp_h_microbe -l nodes=1:ppn=4 -l mem=10gb -l walltime=3000 /ddn1/vol1/site_scratch/leuven/314/vsc31426/Valeria/.snakemake/tmp.11aobbon/snakejob.initialize_qc.190.sh': 'qsub -N initialize_qc -A lp_h_microbe -l nodes=1:ppn=4 -l mem=10gb -l walltime=3000 /ddn1/vol1/site_scratch/leuven/314/vsc31426/Valeria/.snakemake/tmp.11aobbon/snakejob.initialize_qc.190.sh' Error submitting jobscript (exit code 1):

qstat -f -x 50160970

<?xml version="1.0"?>

50160970.tier2-p-moab-2.tier2.hpc.kuleuven.beatlas23beta_setupGM.pbsvsc31426@tier2-p-login-4.genius.hpc.kuleuven.be00:00:081239212kb00:00:30110368kb0Cq24htier2-p-moab-2.tier2.hpc.kuleuven.belp_h_microbeu1579134844tier2-p-login-4.genius.hpc.kuleuven.be:/vsc-hard-mounts/leuven-data/314/vsc31426/scripts/3.atlas_pipeline/atlas23beta_setupGM.pbs.e50160970r23i27n16/0-4nnnaesofie.thijs@uhasselt.be1579134895tier2-p-login-4.genius.hpc.kuleuven.be:/vsc-hard-mounts/leuven-data/314/vsc31426/scripts/3.atlas_pipeline/atlas23beta_setupGM.pbs.o5016097001579134844True1:ppn=524:00:0020gbbigmem1genius1036PBS_O_QUEUE=qdef,PBS_O_HOME=/user/leuven/314/vsc31426,PBS_O_LOGNAME=vsc31426,PBS_O_PATH=/vsc-hard-mounts/leuven-data/314/vsc31426/miniconda3/bin:/vsc-hard-mounts/leuven-data/314/vsc31426/miniconda3/envs/singleM/fxtract:/user/leuven/314/vsc31426/.cargo/bin:/vsc-hard-mounts/leuven-data/314/vsc31426/miniconda3/envs/metannotate/metannotate/included_software:/vsc-hard-mounts/leuven-user/314/vsc31426/.local/bin:/vsc-hard-mounts/leuven-data/314/vsc31426/miniconda3/envs/atlas34binner/DAS_Tool:/apps/leuven/bin:/usr/local/bin:/usr/lpp/mmfs/bin/:.:/usr/bin:/usr/sbin:/usr/lib64/qt-3.3/bin:/opt/moab/bin:/opt/mam/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/opt/ibutils/bin:/usr/NX/bin:/opt/puppetlabs/bin,PBS_O_MAIL=/user/leuven/314/vsc31426/inbox,PBS_O_SHELL=/bin/bash,PBS_O_SUBMIT_FILTER=/apps/leuven/bin/qsub_filter,PBS_O_WORKDIR=/vsc-hard-mounts/leuven-data/314/vsc31426/scripts/3.atlas_pipeline,PBS_O_HOST=tier2-p-login-4.genius.hpc.kuleuven.be,PBS_O_SERVER=tier2-p-moab-2.tier2.hpc.kuleuven.bevsc31426vsc31426E15791348441-A lp_h_microbe atlas23beta_setupGM.pbs15791348651False1579134895031.175807tier2-p-login-4.genius.hpc.kuleuven.be/vsc-hard-mounts/leuven-data/314/vsc31426/scripts/3.atlas_pipeline115104857600kb104857600kballowthreadsr23i27n16:ppn=50-4005r23i27n16

✔ [Jan/16 01:38] vsc31426@tier2-p-login-4 /vsc-hard-mounts/leuven-data/314/vsc31426/scripts/3.atlas_pipeline $ qstat -f -x 50160970.tier2-p-moab-2.tier2.hpc.kuleuven.be <?xml version="1.0"?>

50160970.tier2-p-moab-2.tier2.hpc.kuleuven.beatlas23beta_setupGM.pbsvsc31426@tier2-p-login-4.genius.hpc.kuleuven.be00:00:081239212kb00:00:30110368kb0Cq24htier2-p-moab-2.tier2.hpc.kuleuven.belp_h_microbeu1579134844tier2-p-login-4.genius.hpc.kuleuven.be:/vsc-hard-mounts/leuven-data/314/vsc31426/scripts/3.atlas_pipeline/atlas23beta_setupGM.pbs.e50160970r23i27n16/0-4nnnaesofie.thijs@uhasselt.be1579134895tier2-p-login-4.genius.hpc.kuleuven.be:/vsc-hard-mounts/leuven-data/314/vsc31426/scripts/3.atlas_pipeline/atlas23beta_setupGM.pbs.o5016097001579134844True1:ppn=524:00:0020gbbigmem1genius1036PBS_O_QUEUE=qdef,PBS_O_HOME=/user/leuven/314/vsc31426,PBS_O_LOGNAME=vsc31426,PBS_O_PATH=/vsc-hard-mounts/leuven-data/314/vsc31426/miniconda3/bin:/vsc-hard-mounts/leuven-data/314/vsc31426/miniconda3/envs/singleM/fxtract:/user/leuven/314/vsc31426/.cargo/bin:/vsc-hard-mounts/leuven-data/314/vsc31426/miniconda3/envs/metannotate/metannotate/included_software:/vsc-hard-mounts/leuven-user/314/vsc31426/.local/bin:/vsc-hard-mounts/leuven-data/314/vsc31426/miniconda3/envs/atlas34binner/DAS_Tool:/apps/leuven/bin:/usr/local/bin:/usr/lpp/mmfs/bin/:.:/usr/bin:/usr/sbin:/usr/lib64/qt-3.3/bin:/opt/moab/bin:/opt/mam/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/opt/ibutils/bin:/usr/NX/bin:/opt/puppetlabs/bin,PBS_O_MAIL=/user/leuven/314/vsc31426/inbox,PBS_O_SHELL=/bin/bash,PBS_O_SUBMIT_FILTER=/apps/leuven/bin/qsub_filter,PBS_O_WORKDIR=/vsc-hard-mounts/leuven-data/314/vsc31426/scripts/3.atlas_pipeline,PBS_O_HOST=tier2-p-login-4.genius.hpc.kuleuven.be,PBS_O_SERVER=tier2-p-moab-2.tier2.hpc.kuleuven.bevsc31426vsc31426E15791348441-A lp_h_microbe atlas23beta_setupGM.pbs15791348651False1579134895031.175807tier2-p-login-4.genius.hpc.kuleuven.be/vsc-hard-mounts/leuven-data/314/vsc31426/scripts/3.atlas_pipeline115104857600kb104857600kballowthreadsr23i27n16:ppn=50-4005r23i27n16
SilasK commented 4 years ago

@fconstancias Thank you for your information.

I updated the clusterprofile -template accordingly. The partition is defined with -q and I take the number in front of the dot.

Can you submitt jobs now? What if you re-download the clusterprofile?

SilasK commented 4 years ago

@Sofie8

Ok, if you change the line to:

p = Popen(command, stdout=PIPE, stderr=PIPE, check=True, shell=True)

This is the same as in the cluster-profile for pbs.

In theory, for testing you can run:

$HOME/.config/snakemake/cluster/scheduler.py /ddn1/vol1/site_scratch/leuven/314/vsc31426/Valeria/.snakemake/tmp.11aobbon/snakejob.initialize_qc.190.sh
fconstancias commented 4 years ago

@fconstancias Thank you for your information.

I updated the clusterprofile -template accordingly. The partition is defined with -q and I take the number in front of the dot.

Can you submitt jobs now? What if you re-download the clusterprofile?

Thanks for the update.

I updated the scheduler.py and key_mapping.yaml accordingly.

head scheduler.py  key_mapping.yaml 
==> scheduler.py <==
#!/usr/bin/env python3

import sys, os
from subprocess import Popen, PIPE
import yaml
import re

def eprint(*args, **kwargs):
    print(*args, file=sys.stderr, **kwargs)

==> key_mapping.yaml <==
# only parameters defined in key_mapping (see below) are passed to the command in the order specified.
system: "pbs"  #check if system is defined below

slurm:
  command: "sbatch --parsable"
  key_mapping:
    name: "--job-name={}"
    threads: "-n {}"
    mem: "--mem={}g"
    account: "--account={}"

Then, I run atlas init (from my (metagenome-atlas) conda environment). atlas init --db-dir /gpfs1/scratch/florentin/db/atlas --working-dir EZ_atlas --data-type metagenome --assembler megahit --threads=10 --skip-qc 01_QC

dry run was fine :

atlas run -w EZ_atlas -c /gpfs1/scratch/florentin/EZ/Experiment2/test_gene_catalog/EZ_atlas/config.yaml --profile ~/.config/snakemake/cluster/ --jobs 4 genecatalog -n

so I run atlas run :

atlas run -w EZ_atlas -c /gpfs1/scratch/florentin/EZ/Experiment2/test_gene_catalog/EZ_atlas/config.yaml --profile ~/.config/snakemake/cluster/ --jobs 4 genecatalog

As you can see in the attached log file, now atlas is able to submit jobs ! log.txt

submit command: qsub -N init_pre_assembly_processing -l nodes=1:ppn=4 -l mem=10gb -l walltime=3000 /gpfs1/scratch/florentin/EZ/Experiment2/test_gene_catalog/EZ_atlas/.snakemake/tmp.a7o3xg6m/snakejob.init_pre_assembly_processing.84.sh Submitted job 84 with external jobid '10614'.

but there is an issue :

snakemake.exceptions.WorkflowError: Failed to obtain job status. See above for error message.

qstat confirmed that jobs are submitted properly

qstat

Job id Name User Time Use S Queue


10600.pbs01 init_pre_assemb florentin 00:00:12 R dev
10601.pbs01 init_pre_assemb florentin 00:00:04 R dev
10602.pbs01 init_pre_assemb florentin 0 Q dev
10603.pbs01 init_pre_assemb florentin 0 Q dev
10604.pbs01 init_pre_assemb florentin 0 Q dev
10605.pbs01 init_pre_assemb florentin 0 Q dev
10606.pbs01 init_pre_assemb florentin 0 Q dev
10607.pbs01 init_pre_assemb florentin 0 Q dev
10608.pbs01 init_pre_assemb florentin 0 Q dev
10609.pbs01 init_pre_assemb florentin 0 Q dev
10610.pbs01 init_pre_assemb florentin 0 Q dev
10611.pbs01 init_pre_assemb florentin 0 Q dev
10612.pbs01 init_pre_assemb florentin 0 Q dev
10613.pbs01 init_pre_assemb florentin 0 Q dev
10614.pbs01 init_pre_assemb florentin 0 Q dev

updating the entire ~/.config/snakemake/cluster/ with cookiecutter --output-dir ~/.config/snakemake https://github.com/metagenome-atlas/clusterprofile.git gave the same issue.

SilasK commented 4 years ago

@fconstancias Great submitting works.

Now the cluster status. To begin with, this is not optional. You could comment the line cluster_status in the cluster/config.yaml . Status is important if a job gets killed or finishes early.

On one of your running jobs could you try:

$HOME/.config/snakemake/cluster/cluster_status.py 10600.pbs01

and

$HOME/.config/snakemake/cluster/cluster_status.py 10600

You should get a running back. And if you take the id of a finished job. you should get either success or failed

fconstancias commented 4 years ago

@SilasK yes that's cool, I am closer than ever.

To begin with, this is not optional. You could comment the line cluster_status in the cluster/config.yaml Are you sure? Running on a toy dataset, I had the feeling that it was stuck because of that error. Or do you mean if I comment it then it is not checking so it should work?

There is no cluster_status.py but the following .yaml .py files

ls ~/.config/snakemake/cluster/
cluster_config.yaml  config.yaml  key_mapping.yaml  lsf_status.sh  pbs_status.py  scheduler.py  slurm_status.py

pbs_status.py was not executable by default.

~/.config/snakemake/cluster/pbs_status.py 10637.pbs01
-bash: /home/florentin/.config/snakemake/cluster/pbs_status.py: Permission denied
chmod +x ~/.config/snakemake/cluster/pbs_status.py

~/.config/snakemake/cluster/pbs_status.py 10637.pbs01

Traceback (most recent call last): File "/home/florentin/.config/snakemake/cluster/pbs_status.py", line 12, in xmldoc = ET.ElementTree(ET.fromstring(res.stdout.decode())).getroot() File "/home/florentin/miniconda3/envs/metagenome-atlas/lib/python3.6/xml/etree/ElementTree.py", line 1314, in XML parser.feed(text) xml.etree.ElementTree.ParseError: syntax error: line 1, column 0

~/.config/snakemake/cluster/pbs_status.py 10637

Traceback (most recent call last): File "/home/florentin/.config/snakemake/cluster/pbs_status.py", line 12, in xmldoc = ET.ElementTree(ET.fromstring(res.stdout.decode())).getroot() File "/home/florentin/miniconda3/envs/metagenome-atlas/lib/python3.6/xml/etree/ElementTree.py", line 1314, in XML parser.feed(text) xml.etree.ElementTree.ParseError: syntax error: line 1, column 0

SilasK commented 4 years ago

@fconstancias
For today, you could just comment the line

cluster-status: "pbs_status.py" #

in the cluster/config.yaml

I will try to find a fix.

Sofie8 commented 4 years ago

@SilasK When changing the line: p = Popen(command, stdout=PIPE, stderr=PIPE, check=True, shell=True) It gives check is not valid argument etc. So I changed it to: p = Popen(command, stdout=PIPE, stderr=PIPE, shell=True)

Then it gave error in line 75 of sheduler.py, I tried to understand what it does, its reading my job id, and that is just the ID before the first dot, so I changed: jobid= int(res.strip().split()[-1]) into: jobid= int(res.strip().split('.')[-6])

Running: atlas run all -w /ddn1/vol1/site_scratch/leuven/314/vsc31426/Valeria --profile /ddn1/vol1/site_scratch/leuven/314/vsc31426/newatlas23beta/cluster --jobs 5

It finally, successfully submits jobs:

rule initialize_qc: input: /ddn1/vol1/site_scratch/leuven/314/vsc31426/Valeria/raw/X5A3_R1.fastq.gz, /ddn1/vo l1/site_scratch/leuven/314/vsc31426/Valeria/raw/X5A3_R2.fastq.gz output: X5A3/sequence_quality_control/X5A3_raw_R1.fastq.gz, X5A3/sequence_quality_control /X5A3_raw_R2.fastq.gz log: X5A3/logs/QC/init.log jobid: 182 wildcards: sample=X5A3 priority: 80 threads: 4 resources: mem=10, java_mem=8, time=0.5

submit command: qsub -N initialize_qc -A lp_h_microbe -l nodes=1:ppn=4 -l mem=10gb -l wall time=3000 /ddn1/vol1/site_scratch/leuven/314/vsc31426/Valeria/.snakemake/tmp.dh22p5o1/snakejo b.initialize_qc.182.sh Submitted job 182 with external jobid '50161380'.

It gave then the same error as you are describing above: pbs_status.py: Permission denied So I gave access: chmod +x ~/.config/snakemake/cluster/pbs_status.py tested: ~/.config/snakemake/cluster/pbs_status.py 50161380 It said success

Now however, it gives error in the next job, Error in rule get_read_stats: jobid: 0 output: X5A1/sequence_quality_control/read_stats/raw.zip, X5A1/sequence_quality_control/read_stats/raw_read_counts.tsv log: X5A1/logs/QC/read_stats/raw.log (check log file(s) for error message)

The log says: /usr/bin/bash: line 2: reformat.sh: command not found

reformat.sh is correctly installed, it just needs to activate my atlasenv: source activate atlas23beta In the previous rule, initialize_gc, it successfully activates Activating conda environment: /ddn1/vol1/site_scratch/leuven/314/vsc31426/db/atlas23beta/conda_envs/b70c4153 but not for get_read_stats, so its executing it outside my atlas23beta env? Do I need to add atlas23beta env bin to my bashrc profile?

Lastly, when I submit the job as pbs script, I get still the error: submit command: qsub -N get_read_stats -A lp_h_microbe -l nodes=1:ppn=4 -l mem=10gb -l walltime=3000 /ddn1/vol1/site_scratch/leuven/314/vsc31426/Valeria/.snakemake/tmp.cfc8ds4n/snakejob.get_read_stats.59.sh Traceback (most recent call last): File "/ddn1/vol1/site_scratch/leuven/314/vsc31426/newatlas23beta/cluster/scheduler.py", line 66, in raise Exception("Job can't be submitted\n"+output.decode("utf-8")+error.decode("utf-8")) Exception: Job can't be submitted qsub: submit error (Job rejected by all possible destinations (check syntax, queue resources, ...))

Sofie8 commented 4 years ago

@SilasK

Ok, I asked my systems admin, and the system doesn't allow from within a pbs script that snakemake does qsub. So that is the reason for the rejection error.

In screen, from my login-node, yes it works. So the question is if snakemake only submits qsub jobs via my login node, or also does actual calculations at some points (small calculations are allowed on the login node) but not extensive.

So the only thing I need to fix, is that with each qsub job, it knows it has to load the appropriate conda environment?

Well, in the end, I think, maybe for my case, if I have not too much jobs in parallel, but many samples in one job, the pbs script job submission also still worked I guess. I was just trying to see, how I can it run most efficiently.

SilasK commented 4 years ago

@Sofie8 Great, you get the submit and status working.

For the problem with the missing reformat.sh

In theory, if you start atlas in the atlas23beta env, then it should find refomat.sh. But you get the error, when submitting the jobscript from inside the atlas23 environment?

check if you have initialized conda correctly. conda activate base; conda init bash

@fconstancias It seems the output of qstat -f -xis not the same as Sofie's. That's why the status script doesn't work.