WorkflowError: Failed to obtain job status.

irvinng98 commented 1 year ago

Firstly, thank you making such a wonderful pipeline.

I am trying to run atlas with

atlas run qc --profile cluster

and after a lot of tweaking with the snakemake files, I finally got it to run on my PBS cluster system. However, I now get this error message:

Submitted job 21 with external jobid '5002595'.

[Wed May 10 16:41:13 2023]
rule initialize_qc:
    input: /scratch/irvinng/metagenome-atlas/test_reads/sample1_R1.fastq.gz, /scratch/irvinng/metagenome-atlas/test_reads/sample1_R2.fastq.gz
    output: sample1/sequence_quality_control/sample1_raw_R1.fastq.gz, sample1/sequence_quality_control/sample1_raw_R2.fastq.gz
    log: sample1/logs/QC/init.log
    jobid: 6
    reason: Missing output files: sample1/sequence_quality_control/sample1_raw_R2.fastq.gz, sample1/sequence_quality_control/sample1_raw_R1.fastq.gz
    wildcards: sample=sample1
    priority: 80
    threads: 4
    resources: mem_mb=10000, mem_mib=9537, disk_mb=1000, disk_mib=954, tmpdir=<TBD>, mem=10, java_mem=8, time_min=60, runtime=3600

CLUSTER: 2023-05-10 16:41:14 Automatically choose best queue to submit
CLUSTER: 2023-05-10 16:41:14 Choose queue: mixed_s
CLUSTER: 2023-05-10 16:41:14 parameter 'mem_mb' not in keymapping! It would be better if you add the key to the file: /home/irvinng/.config/snakemake/cluster/key_mapping.yaml 
 I try without the key!
CLUSTER: 2023-05-10 16:41:14 parameter 'queue' not in keymapping! It would be better if you add the key to the file: /home/irvinng/.config/snakemake/cluster/key_mapping.yaml 
 I try without the key!
CLUSTER: 2023-05-10 16:41:14 submit command: qsub -A <allocation> -N initialize_qc -l select=1:ncpus=4:mem=10000mb -l walltime=6000 /scratch/irvinng/metagenome-atlas/.snakemake/tmp.afu9xkjm/snakejob.initialize_qc.6.sh
Submitted job 6 with external jobid '5002596'.
Traceback (most recent call last):
  File "/home/irvinng/.config/snakemake/cluster/pbs_status.py", line 20, in <module>
    xmldoc = ET.ElementTree(ET.fromstring(res.stdout.decode())).getroot()
  File "/scratch/irvinng/mambaforge/envs/atlasenv/lib/python3.10/xml/etree/ElementTree.py", line 1342, in XML
    parser.feed(text)
xml.etree.ElementTree.ParseError: syntax error: line 1, column 0
Failed to obtain job status. See above for error message.
WorkflowError:
Failed to obtain job status. See above for error message.
  File "/scratch/irvinng/mambaforge/envs/atlasenv/lib/python3.10/asyncio/runners.py", line 44, in run
  File "/scratch/irvinng/mambaforge/envs/atlasenv/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Note the path to the log file for debugging.
Documentation is available at: https://metagenome-atlas.readthedocs.io
Issues can be raised at: https://github.com/metagenome-atlas/atlas/issues
Complete log: .snakemake/log/2023-05-10T164104.540110.snakemake.log
[Atlas] CRITICAL: Command 'snakemake --snakefile /scratch/irvinng/mambaforge/envs/atlasenv/lib/python3.10/site-packages/atlas/workflow/Snakefile --directory /scratch/irvinng/metagenome-atlas  --rerun-triggers mtime --jobs 16 --rerun-incomplete --configfile '/scratch/irvinng/metagenome-atlas/config.yaml' --nolock  --profile cluster --use-conda --conda-prefix /scratch/irvinng/metagenome-atlas/databases/conda_envs     --scheduler greedy  qc   ' returned non-zero exit status 1.

Here is the log:

Config file /scratch/irvinng/mambaforge/envs/atlasenv/lib/python3.10/site-packages/atlas/workflow/config/default_config.yaml is extended by additional config specified via the command line.
Building DAG of jobs...
Your conda installation is not configured to use strict channel priorities. This is however crucial for having robust and correct environments (for details, see https://conda-forge.org/docs/user/tipsandtricks.html). Please consider to configure strict priorities by executing 'conda config --set channel_priority strict'.
Using shell: /usr/bin/bash
Provided cluster nodes: 16
Singularity containers: ignored
Job stats:
job                          count    min threads    max threads
-------------------------  -------  -------------  -------------
apply_quality_filter             2              8              8
build_decontamination_db         1              8              8
build_qc_report                  1              1              1
calculate_insert_size            2              4              4
combine_insert_stats             1              1              1
combine_read_counts              1              1              1
combine_read_length_stats        1              1              1
deduplicate_reads                2              8              8
finalize_sample_qc               2              1              1
get_read_stats                  10              4              4
initialize_qc                    2              4              4
qc                               1              1              1
qcreads                          2              1              1
run_decontamination              2              8              8
write_read_counts                2              1              1
total                           32              1              8

Select jobs to execute...

[Wed May 10 16:41:10 2023]
rule build_decontamination_db:
    input: /scratch/irvinng/metagenome-atlas/databases/phiX174_virus.fa
    output: ref/genome/1/summary.txt
    log: logs/QC/build_decontamination_db.log
    jobid: 8
    reason: Missing output files: ref/genome/1/summary.txt
    threads: 8
    resources: mem_mb=60000, mem_mib=57221, disk_mb=1000, disk_mib=954, tmpdir=<TBD>, mem=60, java_mem=51, time_min=300, runtime=18000

Submitted job 8 with external jobid '5002594'.

[Wed May 10 16:41:12 2023]
rule initialize_qc:
    input: /scratch/irvinng/metagenome-atlas/test_reads/sample2_R1.fastq.gz, /scratch/irvinng/metagenome-atlas/test_reads/sample2_R2.fastq.gz
    output: sample2/sequence_quality_control/sample2_raw_R1.fastq.gz, sample2/sequence_quality_control/sample2_raw_R2.fastq.gz
    log: sample2/logs/QC/init.log
    jobid: 21
    reason: Missing output files: sample2/sequence_quality_control/sample2_raw_R1.fastq.gz, sample2/sequence_quality_control/sample2_raw_R2.fastq.gz
    wildcards: sample=sample2
    priority: 80
    threads: 4
    resources: mem_mb=10000, mem_mib=9537, disk_mb=1000, disk_mib=954, tmpdir=<TBD>, mem=10, java_mem=8, time_min=60, runtime=3600

Submitted job 21 with external jobid '5002595'.

[Wed May 10 16:41:13 2023]
rule initialize_qc:
    input: /scratch/irvinng/metagenome-atlas/test_reads/sample1_R1.fastq.gz, /scratch/irvinng/metagenome-atlas/test_reads/sample1_R2.fastq.gz
    output: sample1/sequence_quality_control/sample1_raw_R1.fastq.gz, sample1/sequence_quality_control/sample1_raw_R2.fastq.gz
    log: sample1/logs/QC/init.log
    jobid: 6
    reason: Missing output files: sample1/sequence_quality_control/sample1_raw_R2.fastq.gz, sample1/sequence_quality_control/sample1_raw_R1.fastq.gz
    wildcards: sample=sample1
    priority: 80
    threads: 4
    resources: mem_mb=10000, mem_mib=9537, disk_mb=1000, disk_mib=954, tmpdir=<TBD>, mem=10, java_mem=8, time_min=60, runtime=3600

Submitted job 6 with external jobid '5002596'.
WorkflowError:
Failed to obtain job status. See above for error message.
  File "/scratch/irvinng/mambaforge/envs/atlasenv/lib/python3.10/asyncio/runners.py", line 44, in run
  File "/scratch/irvinng/mambaforge/envs/atlasenv/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: .snakemake/log/2023-05-10T164104.540110.snakemake.log

Atlas version 2.15.1

Additional context Not sure if this is relevant, but when I qstat on my cluster, it looks something like this:

(atlasenv) [irvinng@login02 metagenome-atlas]$ qstat
Job id            Name             User              Time Use S Queue
----------------  ---------------- ----------------  -------- - -----
4984834.pbsha     DP-KIP           ***          49:55:37 R gpu             
4985555.pbsha     HE_NVPF_DOS      ***          96:32:07 R serial_l        
4986184.pbsha     test-gpu         ***                 0 Q gpu             
4986544.pbsha     test-gpu         ***                 0 Q gpu                        
4989461.pbsha     job35.pbs        ***           96:42:28 R serial_l        
4989470.pbsha     svhn_req_3       ***          00:00:06 R gpu             
4989705[].pbsha   train-models-pr* ***                  0 B gpu             
4989723.pbsha     jax_privacy      ***            89:18:48 R gpu             
4989837.pbsha     HE_NVPF_Final_a* ***          2845:33* R parallel_l      
4989999.pbsha     verific          ***          8603:44* R parallel_l      
4990069[].pbsha   mokge            ***                  0 B gpu             
4990919.pbsha     Zn_HE_NVPF_Fina* ***          1355:31* R parallel_l      
4991166.pbsha     yutong_motion    ***                 0 Q gpu             
4991167.pbsha     yutong_motion    ***                 0 Q gpu

and the full Job id would be #######.pbsha.ib.sockeye

irvinng98 commented 1 year ago

This is my pbs_status.py:

#!/usr/bin/env python3
import sys
import subprocess
import xml.etree.cElementTree as ET

jobid = sys.argv[1]

try:
    res = subprocess.run(
        "qstat {}".format(jobid),
        check=True,
        stdout=subprocess.PIPE,
        stderr=subprocess.STDOUT,
        shell=True,
    )

    if not res.stdout:
        print("failed")
    else:
        xmldoc = ET.ElementTree(ET.fromstring(res.stdout.decode())).getroot()
        job_state = xmldoc.findall(".//job_state")

        if len(job_state) == 0:
            print("failed")
        else:
            job_state = job_state[0].text

            if job_state == "C":
                exit_status = xmldoc.findall(".//exit_status")[0].text
                if exit_status == "0":
                    print("success")
                else:
                    print("failed")
            else:
                print("running")

except (subprocess.CalledProcessError, IndexError, KeyboardInterrupt) as e:
    print("failed")

SilasK commented 1 year ago

I see there qsub -A <allocation> -N initialize_qc -l select=1:ncpus=4:mem=10000mb -l walltime=6000

Do you think the -A <allocation> could cause problems?

I also see there are these warings:

CLUSTER: 2023-05-10 16:41:14 parameter 'mem_mb' not in keymapping! It would be better if you add the key to the file: /home/irvinng/.config/snakemake/cluster/key_mapping.yaml I try without the key! CLUSTER: 2023-05-10 16:41:14 parameter 'queue' not in keymapping! It would be better if you add the key to the file: /home/irvinng/.config/snakemake/cluster/key_mapping.yaml I try without the key!

So I suggest you to adapt the key_mapping file for these parameters.

Previoulsy I used the mem argument in GB and now the mem_mb in megabites.

irvinng98 commented 1 year ago

Thanks for replying! I don't think it would be an issue with the allocation or queue because otherwise I wouldn't have been able to submit the job (which was an issue for me initially until I tweaked some stuff). Normally I wouldn't have to specify the queue as my cluster has a default queue (in fact, if I try to specify the queue, it would give me: "qsub: Access to queue is denied"). In terms of the mem_mb key, my cluster doesn't accept it as a separate line (eg. -l mem_mb=10000mb). Instead, it only takes it in the format of -l select=1:ncpus=4:mem=10000mb. As such, I had to tweak the key_mapping.yaml such that it looked like:

pbs:
  command: "qsub"
  key_mapping:
    name: "-N {}"
    account: "-A {}"
    threads: "-l select=1:ncpus={}:mem={}mb"
    time_min: "-l walltime={}00" #min= seconds x 100

and my scheduler.py:

# construct command:
for key in cluster_param:
    if key not in key_mapping:
        logging.warning(
            f"parameter '{key}' not in keymapping! It would be better if you add the key to the file: {key_mapping_file} \n I try without the key!"
        )
    else:
        command += " "
        if key == "threads":
            command += key_mapping[key].format(cluster_param[key], cluster_param["mem_mb"])
        else:
            command += key_mapping[key].format(cluster_param[key])

I was wondering if my error had to do with perhaps how the jobs are named which is messing up the syntax.

As mentioned, the job IDs have a format of #######.pbsha.ib.sockeye and because I'm not super familiar with python, I'm not sure if the qsub_status.py (code is in a comment above) is able to interpret the status.

SilasK commented 1 year ago

Ok, you found a hack to submit the script. Great!

Ok, let's look at the job status script.

when you run qstat 4989470 what do you get back? a xml? Does it need to be qstat 4989470.pbsha to work?

I don't know how to parse the output correctly. I took the code from another pbs torque profile. and cannot test it as I don't have pbs system.

If you get a xml we might try to find how to adapt the code to make it work.

If you know some options on how to get a simpler text output using qstat It would ease the process.

In the meantime, I think you can also comment out the pbs_statsu.py in the cluster_profile/config.yaml .

github-actions[bot] commented 1 year ago

There was no activity since some time. I hope your issue is solved in the mean time. This issue will automatically close soon if no further activity occurs.

Thank you for your contributions.

metagenome-atlas / atlas

WorkflowError: Failed to obtain job status. #646