Closed tseemann closed 4 years ago
If the reference genome is larger, it takes more time to map a read since the search space is larger. Therefore, the multi-threading will perform better if your reference is a large genome such as hg38. Since your reference is bacterial genome, each thread completes their job in a short time, they are queuing up to get new job. In such cases, the running time is bounded to read the fastq files. If your reference is hg38, I think using 32-threads will perform much faster than 16-threads.
I don't think reading the FASTQ from disk is a bottle neck. I can put it in RAM eg. /dev/shm
and the problem still exists. Are there any options or constants you can suggest changing to improve the throughput?
It's very interesting. I'll find out which parts are the bottle necks.
I think the bottle neck is the design of reading the input data. Every thread has to line up to read the FASTQ file. Suppose every thread takes 1 second to read the FASTQ file for getting 4000 reads every time and it takes 10 seconds to map those reads, then if there are more than 10 threads, threads have to wait for its turn to get file access. That's why reading the FASTQ is the bottle neck.
I was wondering if the FASTQ file is divided into N parts for N threads, each thread reads the file at the same time and it only reads the part of the file it is assigned, then it does not have to wait to get file access. However, file seeking also takes time. I'll try to implement this idea and see how it works.
@hsinnan75 if the FASTQ files were compressed with bgzip
or gzip --rsyncable
maybe this seeking behaviour would be quicker?
can you have separate threads for reading, and decoding, the FASTQ and put it into block queues for processing?
I partitioned the input data and had every thread read a particular partition of the input data. However, the performance did not get improved. I changed the block size to 200. The performance looked the same. I'll keep trying to find other ways.
MapCaller does not scale beyond 16 threads for me on bacterial genomes with ~1 M reads. Is there something special about 16 ? I don't think the bottleneck is disk I/O so I am wondering if you have some hard-coded "chunking" results or other things that could affect performance past 16 threads. Maybe
#define BlockSize 100
?