Behaviour scaling to more threads

hsinnan75 / MapCaller

MapCaller – An efficient and versatile approach for short-read alignment and variant detection in high-throughput sequenced genomes

MIT License

29 stars 5 forks source link

Behaviour scaling to more threads #50

Closed tseemann closed 4 years ago

tseemann commented 4 years ago

MapCaller does not scale beyond 16 threads for me on bacterial genomes with ~1 M reads. Is there something special about 16 ? I don't think the bottleneck is disk I/O so I am wondering if you have some hard-coded "chunking" results or other things that could affect performance past 16 threads. Maybe#define BlockSize 100 ?

    -t         WALL TIME
     1  real    0m59.786s
     2  real    0m30.103s
     3  real    0m20.521s
     4  real    0m15.688s
     5  real    0m13.048s
     6  real    0m11.066s
     7  real    0m9.626s
     8  real    0m8.573s
     9  real    0m7.863s
    10  real    0m7.260s
    11  real    0m6.828s
    12  real    0m6.384s
    13  real    0m6.202s
    14  real    0m5.990s
    15  real    0m5.708s
    16  real    0m5.771s
    17  real    0m5.823s
    18  real    0m5.901s
    19  real    0m5.820s
    20  real    0m5.895s
    21  real    0m5.906s
    22  real    0m5.880s
    .....
    59  real    0m6.085s
    60  real    0m6.131s
    61  real    0m6.032s
    62  real    0m6.104s
    63  real    0m6.120s
    64  real    0m6.141s
    65  real    0m6.075s
    66  real    0m6.100s
    67  real    0m6.048s
    68  real    0m6.114s
    69  real    0m6.180s
    70  real    0m6.134s
    71  real    0m6.172s
    72  real    0m6.139s

hsinnan75 commented 4 years ago

If the reference genome is larger, it takes more time to map a read since the search space is larger. Therefore, the multi-threading will perform better if your reference is a large genome such as hg38. Since your reference is bacterial genome, each thread completes their job in a short time, they are queuing up to get new job. In such cases, the running time is bounded to read the fastq files. If your reference is hg38, I think using 32-threads will perform much faster than 16-threads.

tseemann commented 4 years ago

I don't think reading the FASTQ from disk is a bottle neck. I can put it in RAM eg. /dev/shm and the problem still exists. Are there any options or constants you can suggest changing to improve the throughput?

hsinnan75 commented 4 years ago

It's very interesting. I'll find out which parts are the bottle necks.

hsinnan75 commented 4 years ago

I think the bottle neck is the design of reading the input data. Every thread has to line up to read the FASTQ file. Suppose every thread takes 1 second to read the FASTQ file for getting 4000 reads every time and it takes 10 seconds to map those reads, then if there are more than 10 threads, threads have to wait for its turn to get file access. That's why reading the FASTQ is the bottle neck.

I was wondering if the FASTQ file is divided into N parts for N threads, each thread reads the file at the same time and it only reads the part of the file it is assigned, then it does not have to wait to get file access. However, file seeking also takes time. I'll try to implement this idea and see how it works.

tseemann commented 4 years ago

@hsinnan75 if the FASTQ files were compressed with bgzip or gzip --rsyncable maybe this seeking behaviour would be quicker? can you have separate threads for reading, and decoding, the FASTQ and put it into block queues for processing?

hsinnan75 commented 4 years ago

I partitioned the input data and had every thread read a particular partition of the input data. However, the performance did not get improved. I changed the block size to 200. The performance looked the same. I'll keep trying to find other ways.