Multithreading - Githubissues

ksahlin commented 2 years ago

Hi @rcedgar,

I ran URMAP in my benchmark on short-read aligners and noticed that, while it is very fast, it has low CPU usage (between 10% and 90%) regardless of -threads N with N=1 4, 8, or 16. I ran URMAP as follows:

urmap -map2 reads1.fq -reverse reads2.fq -ufi index.ufi -samout outfile.sam -threads N  2>&1 | tee time_and_mem.xt

I also copied reads and index to the local compute node before aligning to minimize disk latency.

Some details in this twitter thread: https://twitter.com/krsahlin/status/1469327435103621127

Do you have an idea of why that is?

Also, I noticed that URMAP fails with a segfault on all my instances with read lengths of 2*300nt PE reads.

Best, Kristoffer

rcedgar commented 2 years ago

Thanks for the heads up. What does /bin/time report for CPU usage? The segfault is most likely a bug, happy to take a look if you can post data which reproduces it, please try to reduce the data size if you do that e.g. try repeatedly halving the input size and see if the first or second half causes the segfault.

ksahlin commented 2 years ago

For the jobs that had 10%-90% CPU usage (based on /usr/bin/time -v), I was setting -threads 1. Unfortunately, I overwrote the logfiles of those jobs with my latest attempt where I omitted to specify the -threads parameter. For the new jobs where I didn't specify the -threads parameter, I observed a CPU usage of between 92-97%. I have below pasted an example job. URMAP is given a node with 20 cores and 128Gb memory.

I also now noticed the warning WARNING: Max OMP threads 1 below. Probably highly relevant?

urmap v1.0.1480_i86linux64, 131Gb RAM, 20 cores
(C) Copyright 2019 Robert C. Edgar
https://drive5.com/urmap

00:20 30.1Gb 100% Reading index index.ufi 
00:20 30.1Gb CPU has 20 cores, defaulting to 10 threads

WARNING: Max OMP threads 1

36:16 30.1Gb 100% Mapping (paired) reads_L.fq 

              20  Seconds to load index
            35.9  Minutes in mapper
      20,000,000  Reads (20.0M)
          9276.4  Reads/sec. (1 threads)
      19,328,546  Mapped Q>=10 (96.6%)
         592,350  Mapped Q< 10 (3.0%)
          79,104  Unmapped (0.4%)

36:16 30.1Gb Exiting

    Command being timed: "urmap -map2 /scratch/23690412/reads_L.fq -reverse /scratch/23690412/reads_R.fq -ufi /scratch/23690412/index.ufi -samout /scratch/23690412/aligned.sam"
    User time (seconds): 1849.33
    System time (seconds): 280.65
    Percent of CPU this job got: 97%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 36:19.22
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 29427296
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 3
    Minor (reclaiming a frame) page faults: 326673019
    Voluntary context switches: 25204
    Involuntary context switches: 5843
    Swaps: 0
    File system inputs: 6449864
    File system outputs: 21951392
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 4096
    Exit status: 0

ksahlin commented 2 years ago

As for the segfault, I ran URMAP on the first 100 paired reads of an instance where it segfaulted. Output and reads pasted below.

Command

urmap -map2 ten_reads_L.fq -reverse  ten_reads_R.fq  -ufi Homo_sapiens.GRCh38.dna.fa.ufi  -samout aligned.sam

Output

urmap v1.0.1480_i86linux64, 131Gb RAM, 20 cores
(C) Copyright 2019 Robert C. Edgar
https://drive5.com/urmap

00:19 30.1Gb 100% Reading index Homo_sapiens.GRCh38.dna.primary_assembly_modified_headers.fa.ufi 
00:19 30.1Gb CPU has 20 cores, defaulting to 10 threads
00:19 30.1Gb 0.00% Mapping (paired) test_reads_L.fq/var/spool/slurmd/job23716540/slurm_script: line 15:  9693 Segmentation fault      (core dumped) urmap -map2 test_reads_L.fq -reverse tenst_reads_R.fq -ufi /proj/snic2020-16-138/ultra_eval/genomes/Homo_sapiens.GRCh38.dna.primary_assembly_modified_headers.fa.ufi -samout aligned.sam

test_reads.zip

rcedgar commented 2 years ago

"noticed the warning WARNING: Max OMP threads 1 below. Probably highly relevant?" Yes :-) This means that there is a maximum of 1 thread imposed by your OMP configuration, usually this happens because you complied without linking to the OMP library. Second possibility is that the OMP_NUM_THREADS environment variable is set to 1, but this would be a strange default setting. You can try setting OMP_NUM_THREADS to 99 or something like that to see if it fixes the warning. This will probably make a huge difference to your benchmark so I'd appreciate it if you posed an update / correction once you figure this out. Thanks for the zip file, I'll take a look.

ksahlin commented 2 years ago

Figured so :) Thanks!

I never compiled URMAP. I downloaded the precompiled Linux binary urmap available here. Do you think the binary needs an update?

I don't think the OMP_NUM_THREADS environment variable is set to 1, although I'm not sure how I can verify that(?). The reason that I don't think so is that I can compile my own aligner that uses OMP with the flag -fopenmp in the same environment and multithreading works well.

I will try installing from source instead, perhaps tomorrow. and I will definitely rerun and update the plot once I update the installation.

rcedgar commented 2 years ago

The binary https://github.com/rcedgar/urmap/raw/master/bin/urmap supports OMP, I'm running test_reads.zip with it right now, so far it has worked several times without segfaulting. Did you build the ufi on the same machine that you ran the mapping test? You can check OMP environment variables by running:

env | grep OMP

from the same shell that you run urmap.

rcedgar commented 2 years ago

32-bit usearch uses similar OMP code; making a small usearch test could be a quicker and simpler probe of your OMP settings compared to running urmap (no long load time for the index).

ksahlin commented 2 years ago

Thanks for this! I will compile from source tomorrow and also make sure the index and alignment job is run on the same computing node (currently I copy the index to the computing node of the alignment job).

If it is because of the copying of index, it is strange that all the 300nt read instances segfaults, while all the 100,150,200, and 250nt instances completes without segfault. Anyway, will have a look tomorrow.

rcedgar commented 2 years ago

I don't recommend recompiling from source -- I recommend using the precompiled binary because I just verified (1) that it supports OMP multi-threading and (2) if I build the index then run the mapping test on your reads it does not segfault. If you get different results with the same binary, then it must be something different about your machine in which case I would probably need ssh access to trouble-shoot. I agree it's puzzling that the shorter reads don't crash, but I need to reproduce a bug to be able to analyze it.

ksahlin commented 2 years ago

I verified that I have the same binary. I do

wget https://github.com/rcedgar/urmap/raw/master/bin/urmap
chmod +x urmap

$ ./urmap 

urmap v1.0.1480_i86linux64, 264Gb RAM, 40 cores
(C) Copyright 2019 Robert C. Edgar
https://drive5.com/urmap

OMP environment:

$ env | grep OMP
CONDA_PROMPT_MODIFIER=(base)

I start a job on a node as:

urmap -make_ufi Homo_sapiens.GRCh38.dna.primary_assembly.fa  -output tmp_index.ufi
urmap -map2 300_L.fq -reverse 300_R.fq -ufi tmp_index.ufi -samout aligned.sam

This is the output (I run all jobs through the Snakemake workflow).

Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 20
Rules claiming more threads will be scaled down.
Job counts:
    count   jobs
    1   urmap_align
    1

[Mon Dec 13 09:25:58 2021]
rule urmap_align:
    input: /proj/snic2020-16-138/strobemap_eval/reads_PE/SIM3/300_L.fq, /proj/snic2020-16-138/strobemap_eval/reads_PE/SIM3/300_R.fq, /proj/snic2020-16-138/ultra_eval/genomes/Homo_sapiens.GRCh38.dna.primary_assembly_modified_headers.fa.ufi
    output: /proj/snic2020-16-138/strobemap_eval/time_and_mem/urmap/16/300/aligning_time_and_mem.txt, /proj/snic2020-16-138/strobemap_eval/alignments/urmap/16/300.sam
    jobid: 0
    wildcards: n_threads=16, read_length=300

Copied reads to: 23738026

urmap v1.0.1480_i86linux64, 131Gb RAM, 20 cores
(C) Copyright 2019 Robert C. Edgar
https://drive5.com/urmap

00:00 3.1Mb 
00:00 3.1Mb   Genome size  3.2G
00:00 3.1Mb         Slots  5392814809 (5.4G)
00:00 3.1Mb   Load factor  0.58
00:00 3.1Mb   Word length  24
00:00 3.1Mb    Max abund.  32
00:00 3.1Mb 
00:24 3.5Gb 100% Reading Homo_sapiens.GRCh38.dna.primary_assembly_modified_headers.fa 
00:32 3.4Gb Converting to upper case
00:35 3.4Gb Initialize seqs
00:57 30.4Gb Init vectors
01:00 35.8Gb Init slot counts
03:39 35.8Gb Count slots
03:42 41.2Gb Init minus slot counts
06:23 41.2Gb Count slots minus
15:51 41.2Gb Index
15:51 41.2Gb 0 slots truncated
18:08 41.2Gb Writing /scratch/23738026/urmap_index.ufi
19:14 41.2Gb Exiting

    Command being timed: "urmap -make_ufi /proj/snic2020-16-138/ultra_eval/genomes/Homo_sapiens.GRCh38.dna.primary_assembly_modified_headers.fa -output /scratch/23738026/urmap_index.ufi"
    User time (seconds): 851.02
    System time (seconds): 139.41
    Percent of CPU this job got: 85%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 19:16.73
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 40147516
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 3
    Minor (reclaiming a frame) page faults: 143158456
    Voluntary context switches: 102794
    Involuntary context switches: 1297
    Swaps: 0
    File system inputs: 1053640
    File system outputs: 58719936
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 4096
    Exit status: 0
Created index in: 23738026

urmap v1.0.1480_i86linux64, 131Gb RAM, 20 cores
(C) Copyright 2019 Robert C. Edgar
https://drive5.com/urmap

00:18 30.1Gb 100% Reading index urmap_index.ufi 
00:18 30.1Gb CPU has 20 cores, defaulting to 10 threads

WARNING: Max OMP threads 1

00:18 30.1Gb 0.00% Mapping (paired) reads_L.fqCommand terminated by signal 11
    Command being timed: "urmap -map2 /scratch/23738026/reads_L.fq -reverse /scratch/23738026/reads_R.fq -ufi /scratch/23738026/urmap_index.ufi -samout /scratch/23738026/aligned.sam"
    User time (seconds): 0.02
    System time (seconds): 19.45
    Percent of CPU this job got: 99%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:19.48
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 29426964
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 0
    Minor (reclaiming a frame) page faults: 7371003
    Voluntary context switches: 20
    Involuntary context switches: 49
    Swaps: 0
    File system inputs: 0
    File system outputs: 104
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 4096
    Exit status: 0
[Mon Dec 13 09:45:49 2021]
Error in rule urmap_align:
    jobid: 0
    output: /proj/snic2020-16-138/strobemap_eval/time_and_mem/urmap/16/300/aligning_time_and_mem.txt, /proj/snic2020-16-138/strobemap_eval/alignments/urmap/16/300.sam

RuleException:
CalledProcessError in line 262 of /domus/h1/kris/source/alignment_evaluation/evaluation_multithread/Snakefile:
Command 'set -o pipefail;  /usr/bin/time -v  urmap -map2 /scratch/$SLURM_JOB_ID/reads_L.fq -reverse /scratch/$SLURM_JOB_ID/reads_R.fq -ufi /scratch/$SLURM_JOB_ID/urmap_index.ufi -samout /scratch/$SLURM_JOB_ID/aligned.sam 2>&1 | tee /scratch/$SLURM_JOB_ID/aligning_time_and_mem.txt' returned non-zero exit status 11.
  File "/proj/snic2020-15-201/anaconda3/envs/scisoseq-eval/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 2318, in run_wrapper
  File "/domus/h1/kris/source/alignment_evaluation/evaluation_multithread/Snakefile", line 262, in __rule_urmap_align
  File "/proj/snic2020-15-201/anaconda3/envs/scisoseq-eval/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 560, in _callback
  File "/proj/snic2020-15-201/anaconda3/envs/scisoseq-eval/lib/python3.9/concurrent/futures/thread.py", line 52, in run
  File "/proj/snic2020-15-201/anaconda3/envs/scisoseq-eval/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 546, in cached_or_run
  File "/proj/snic2020-15-201/anaconda3/envs/scisoseq-eval/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 2349, in run_wrapper
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
creating /proj/snic2020-16-138/strobemap_eval/alignments/urmap/16/

As for USEARCH32, I also get it is limited in CPU usage, but I don't get the OMP warning:

$ wget https://drive5.com/downloads/usearch11.0.667_i86linux32.gz
$ gunzip usearch11.0.667_i86linux32.gz
$ chmod +x usearch11.0.667_i86linux32 
$ ./usearch11.0.667_i86linux32 
usearch v11.0.667_i86linux32, 4.0Gb RAM (264Gb total), 40 cores
(C) Copyright 2013-18 Robert C. Edgar, all rights reserved.
https://drive5.com/usearch

License: personal use only

Clustering of 50k 300nt (short read) sequences:

usearch v11.0.667_i86linux32, 4.0Gb RAM (264Gb total), 20 cores
(C) Copyright 2013-18 Robert C. Edgar, all rights reserved.
https://drive5.com/usearch

License: personal use only

00:00 73Mb    100.0% Reading 50k_reads_R.fq
00:00 40Mb   CPU has 20 cores, defaulting to 10 threads
00:00 76Mb    100.0% DF
00:00 77Mb   50000 seqs, 50000 uniques, 50000 singletons (100.0%)
00:00 77Mb   Min size 1, median 1, max 1, avg 1.00
00:00 75Mb    100.0% DB
00:16 189Mb   100.0% 48364 clusters, max size 32, avg 1.0
00:16 189Mb   100.0% Writing centroids to nr.fasta       

      Seqs  50000 (50.0k)
  Clusters  48364 (48.4k)
  Max size  32
  Avg size  1.0
  Min size  1
Singletons  47850 (47.9k), 95.7% of seqs, 98.9% of clusters
   Max mem  189Mb
      Time  16.0s
Throughput  3125.0 seqs/sec.

    Command being timed: "./usearch11.0.667_i86linux32 -cluster_fast 50k_reads_R.fq -id 0.9 -centroids nr.fasta"
    User time (seconds): 16.01
    System time (seconds): 0.27
    Percent of CPU this job got: 97%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:16.67
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 145868
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 4
    Minor (reclaiming a frame) page faults: 121336
    Voluntary context switches: 1368
    Involuntary context switches: 51
    Swaps: 0
    File system inputs: 66728
    File system outputs: 30520
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 4096
    Exit status: 0

I will put the full 300 reads dataset on my dropbox to see if you can reproduce the segfault with urmap. I will contact you over email with the link.

ksahlin commented 2 years ago

You should now have received a dropbox link to a dataset for which URMAP fails on the server I'm running experiments on.

rcedgar commented 2 years ago

Worked ok for me, I re-built the index with your FASTA then mapping ran to completion with 96.9% reads mapped at Q>=10. How much RAM does your server have?

ksahlin commented 2 years ago

Very strange. I reran the job again, not as part of the Snakemake pipeline but as a single submitted job to a node. The nodes have 20 cores and 131Gb of memory (can be seen in the specs below). It gives a 5882 Segmentation fault as seen below. I have pasted both the script and the output below.

I'm running the jobs on a centralized server for academic computing jobs UPPMAX. Getting an account and setup here, and additionally applying for membership in my project is not a quick process from what I remember setting my account up. Up to you. I will send my project number in an email if you do decide on this process.

Perhaps I can try it on some other server through collaboration (in the future..), and in the meantime, we can let this rest and see if I come up for someone else?

As for the results I tweeted about, and perhaps more important; upcoming analysis in a paper. I could either keep the accuracy, memory usage, and runtime for one core (which is an analysis I also have for all aligners) but exclude the multiprocessing and make a note (referring to this issue). I would also make a note about the 300 real lengths instances referring to this issue. Or I could leave out URMAP completely from the analysis. Which one do you prefer?

Script:

$ cat test_urmap.sh 
#!/bin/sh
#SBATCH -A snic2020-5-651
#SBATCH --time=02:00:00
#SBATCH -p node
#SBATCH -n 1
#SBATCH -C mem128GB
# Memory per node specification is in MB. It is optional. 
# The default limit is 3000MB per core.
#SBATCH --job-name="test_urmap"
#SBATCH --mail-user=XXX
#SBATCH --mail-type=ALL

#conda init bash
set -o errexit

tmp_index="/scratch/$SLURM_JOB_ID/urmap_index.ufi"
genome="/proj/snic2020-16-138/ultra_eval/genomes/Homo_sapiens.GRCh38.dna.primary_assembly_modified_headers.fa"
tmp_folder="/scratch/$SLURM_JOB_ID/"

urmap -make_ufi $genome  -output $tmp_index
urmap -map2 /proj/snic2020-16-138/strobemap_eval/reads_PE/SIM3/300_L.fq -reverse /proj/snic2020-16-138/strobemap_eval/reads_PE/SIM3/300_R.fq -ufi $tmp_index -samout $tmp_folder/aligned.sam

cp $tmp_folder/aligned.sam aligned_urmap.sam

Output:

$ cat slurm-23745045.out 

urmap v1.0.1480_i86linux64, 131Gb RAM, 20 cores
(C) Copyright 2019 Robert C. Edgar
https://drive5.com/urmap

00:00 3.1Mb 
00:00 3.1Mb   Genome size  3.2G
00:00 3.1Mb         Slots  5392814809 (5.4G)
00:00 3.1Mb   Load factor  0.58
00:00 3.1Mb   Word length  24
00:00 3.1Mb    Max abund.  32
00:00 3.1Mb 
00:24 3.5Gb 100% Reading Homo_sapiens.GRCh38.dna.primary_assembly_modified_headers.fa 
00:31 3.4Gb Converting to upper case
00:33 3.4Gb Initialize seqs
00:51 30.4Gb Init vectors
00:54 35.8Gb Init slot counts
03:27 35.8Gb Count slots
03:30 41.2Gb Init minus slot counts
06:03 41.2Gb Count slots minus
15:09 41.2Gb Index
15:09 41.2Gb 0 slots truncated
16:44 41.2Gb Writing /scratch/23745045/urmap_index.ufi
17:48 41.2Gb Exiting

urmap v1.0.1480_i86linux64, 131Gb RAM, 20 cores
(C) Copyright 2019 Robert C. Edgar
https://drive5.com/urmap

00:17 30.1Gb 100% Reading index urmap_index.ufi 
00:17 30.1Gb CPU has 20 cores, defaulting to 10 threads
00:18 30.8Gb 0.54% Mapping (paired) 300_L.fq/var/spool/slurmd/job23745045/slurm_script: line 22:  5882 Segmentation fault      (core dumped) urmap -map2 /proj/snic2020-16-138/strobemap_eval/reads_PE/SIM3/300_L.fq -reverse /proj/snic2020-16-138/strobemap_eval/reads_PE/SIM3/300_R.fq -ufi $tmp_index -samout $tmp_folder/aligned.sam

ksahlin commented 2 years ago

And here are the details on the computing nodes and processors:

Rackham consists of 334 compute servers (nodes) where each compute server 
consists of two 10-core Xeon E5-2630 V4 processors running at 2.2 GHz (turbo 3.1 GHz). 
We provide 272 nodes with 128 GB memoryi (r33-r304) and 32 nodes with 256 GB (r1-r32). 
All nodes are interconnected with a 2:1 oversubscribed FDR (56 GB/s) Infiniband fabric. 
In total Rackham provides 6080 CPU cores in compute nodes.

Some job infro for the last job (Seems to have reached peak memory 38Gb, probably during indexing)

2021-12-13 22:58:42 jobid=23745045 jobstate=FAILED nodes=r53 procs=20 partition=node 
qos=normal jobname=test_urmap maxmemory_in_GiB=38.3 maxmemory_node=r53
submit_time=2021-12-13T22:27:56 runtime=00:18:12

jianshu93 commented 2 years ago

Hello all，

I can help to double check. I do not have any problems with a few dafaset except the NM tag problem. Maybe you could also help with it（no NM tag for identity estimation）？ Plus I do not see how many cores you requst in sbatch script. Maybe also try in an interactive mode？

jianshu.zhao@gatech.edu

thanks，

Jianshu

ksahlin commented 2 years ago

Thanks for helping out @jianshu93! I have now shared the link to the data set on google drive.

I specified

#SBATCH -p node
#SBATCH -n 1

This means partition given in the -n parameter (node), and the node has 20 cores. To specify only one core one does

#SBATCH -p core
#SBATCH -n 1

details under 'Details about the "core" and "node" partitions' here

rcedgar / urmap

Multithreading #8