[BUG] Error when running template

cd3050 commented 1 year ago

Describe the bug I installed PGAP and tried running the template first. I used MG37 genome and the code I used is below ` ./pgap.py \

-n -o MG37 \ --container-path /scratch/cd3050/pytorch-example/pgap-utils_2022-10-03.build6384.sif \ --docker singularity \ test_genomes/MG37/input.yaml`

However, I got a error message as below: "PGAP version 2022-10-03.build6384 is up to date. Output will be placed in: /scratch/cd3050/Jonas/PGAP/MG37.3 WARNING: open files is less than the recommended value of 8000 PGAP failed, docker exited with rc = 255 Unable to find error in log file."

Log Files the log file as below cwltool.log

Thanks very much!

azat-badretdin commented 1 year ago

Thank you, Caichen Duan, for your report!

Your log has the following error:

"/export/home/gpipe/TeamCity/Agent3/work/427aceaa834ecbb6/ncbi_cxx/src/serial/objistrjson.cpp", line 214: ncbi::CObjectIStreamJson::UnexpectedMember() --- line 1: "taxon": unexpected member, should be one of: "strain" "genus_species"  ( at JsonValue.organism)

which indicates that you are using outdated submol.yaml input (from one of very old installations?). Please use a new installation test_genomes directory. By default, new installation happens under $HOME/.pgap, you might want to check over there for new installed files.

cd3050 commented 1 year ago

Hi thanks for your reply. I tried re-installed and this is what I used this time: Original command: ./pgap.py --docker singularity -r -o mg37_results /vast/cd3050/pgap/local/test_genomes/MG37/input.yaml

But I still got error from running this pipeline. Could you please help me to look at the new log file?

cwltool.log

azat-badretdin commented 1 year ago

Snippet:

2042006/000/0000/P  DB032896396D6741 0007/0007 2022-12-12T02:21:24.605072 log-1.hpc.nyu.edu UNK_CLIENT      UNK_SESSION              cluster_blastp_wnode Info: LIB "wn_app.cpp", line 321: ncbi::CGPX_WorkerApp::Run() --- output path: /tmp/xoxa3hqq/output
[2022-12-12 02:21:39] INFO [job cluster_blastp_wnode_2] Max memory used: 212MiB
[2022-12-12 02:21:39] WARNING [job cluster_blastp_wnode_2] was terminated by signal: SIGKILL
[2022-12-12 02:21:51] WARNING [job cluster_blastp_wnode_2] completed permanentFail

This typically indicates to system being tight (MG37 is one of the smallest proks, genome-sequence wise) on resources

The relevant section is reported earlier in the log:

--- Start Runtime Report ---
{
    "CPU cores": 96,
    "Docker image": "ncbi/pgap:2022-10-03.build6384",
    "cpu flags": "fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single intel_ppin ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts pku ospke avx512_vnni md_clear flush_l1d arch_capabilities",
    "cpu model": "Intel(R) Xeon(R) Platinum 8268 CPU @ 2.90GHz",
    "max user processes": 770035,
    "memory (GiB)": 188.0,
    "memory per CPU core (GiB)": 2.0,
    "open files": 1024,
    "tmp disk space (GiB)": 70.1,
    "virtual memory": "unlimited",
    "work disk space (GiB)": 2808598.5
}
--- End Runtime Report ---

Obviously, your system formally is not "tight". You have 96 CPU cores.

I would recommend to start by reducing the number of cores using pgap.py --cpus 8

cd3050 commented 1 year ago

It works! Thank you very much! I haven't specified the number of CPUs called in the process, that may be the default setting of HPC. Thanks for your explanation.

azat-badretdin commented 1 year ago

You are very welcome, Caichen Duan!

ncbi / pgap

[BUG] Error when running template #233