bam query slow - Githubissues

pontushojer commented 1 year ago

I have indexed a quite large barcoded BAM (~220 Gb) file using LRez and now I want to perform queries for barcodes. I have several lists of barcodes with about 2000 entries in each. Unfortunately it is very slow. If I read the paper correctly queries of about 1000 barcodes took at most 10 min. For me it has been running for almost 3 hours with files of 2000 queries without finishing.

Commands

# Index
LRez index bam- b file.bam -o file.bam.bci -f -t 10

# Query
LRez query bam -b file.bam -i file.bam.bci -l list.bxu -o list.bam -t 10 -H

Below is the memory/CPU usage. I am run two query commands in parallel with 10 threads each.

I guess the initial sharp memory-incline is from loading the index (size about 55Gb on disk), this seams to take about 10 min or so. Then it is presumably doing index lookups for the list of barcodes which is taking much longer that I would expect. Any idea why this is so slow?

As a side-note it seems that core utilisation is quite poor with only about 1 core per process being used.

clemaitre commented 1 year ago

Hi,

Thank you for the feedback.

We have already encoutered a similar slowness problem for a particular linked-read dataset (from TELL-seq) in which the barcode distribution was excessively skewed, with most barcodes appearing on only one or two read pairs and a very small number of barcodes shared by hundreds of thousands of read pairs. In this case, if the list contains such a barcode with an excessive number of reads to extract, it can be very slow and the multi-threading seems not to work since the parallelization is performed by splitting the barcode list.

Could it be the case for your dataset ? What is the linked-read technology used for your dataset ?

Best, Claire

pontushojer commented 1 year ago

Hi Clarie,

We are using DBS linked reads as described here. The resulting data is quite similar to TELL-Seq but with 20 bp barcodes. The distribution of reads per barcode is for sure skewed in the way you describe. So you are saying that barcodes with high number of reads would cause this slowdown? The upper-end barcodes for my dataset have in the order of ~10,000 reads associated with them, so not quite hundreds of thousands. For this number of queries per barcode do you think this is still a problem?

Regarding the multithreading you say its parallelised over the list of barcodes? If this was the case I would assume an initial high CPU load with a gradual decrease as only barcodes with many reads remain. In this case the load is more or less constant.

Another side note, would it not be preferable to parallelise over offsets instead of barcodes? This would be more efficient regardless for the linked-read technology. It would also help when doing single queries.

Ps. The queries in question finished after about 5.5 hours.

pontushojer commented 1 year ago

I run LRez stats on the output BAM to check the reads per barcode distribution for the one of the query lists that finished. This is the output

Number of barcodes: 2128
Number of mapped reads: 2328152

Number of reads per barcode:
     1st quantile: 53
     median: 170
     3rd quantile: 591

Thats about 1000 reads per barcode on average, but with a median of 170 the distribution is quite skewed.

clemaitre commented 1 year ago

Hi,

Thank you for the information on this technology. We were somewhat aware of its existence, but we have never tested LRez on such datasets.

The order of ~10,000 reads per barcode seems to me not so high to slow that much the query process. If you want to check the maximal number of reads per barcode, I just updated LRez stats so that min and max values are also reported in the output (commit cd56d710954f9ca7eaf5e56ee327208711bad134).

I have checked, the multithreading is effectively performed buy splitting the barcode list. I agree with you that we should expect an initial high CPU load and then a decrease. I do not know why we do not see this. Concerning parallelising over offsets instead of barcodes, we initially did not think it necessary since barcodes had few reads in our initial read datasets and query time was so low for only one barcode. But this is definitely an interesting avenue to explore for future developments.

Thank you for these useful comments, and by the way thank you also for creating and keeping up to date the awsome repository https://github.com/pontushojer/awesome-linked-reads !

Best, Claire

pontushojer commented 1 year ago

Thank you for the information on this technology. We were somewhat aware of its existence, but we have never tested LRez on such datasets.

It has not been used outside our lab to my knowledge, so not to many datasets are available at the moment. But asside from this issue and https://github.com/morispi/LRez/issues/8 I have had no major issue using it on this data.

The order of ~10,000 reads per barcode seems to me not so high to slow that much the query process. If you want to check the maximal number of reads per barcode, I just updated LRez stats so that min and max values are also reported in the output (commit cd56d71).

I pulled the latest version and run in on the same output BAM

min: 1
1st quantile: 53
median: 170
3rd quantile: 591
max: 20736

So the maximum is 20736, again no too many in my opinion.

I have checked, the multithreading is effectively performed buy splitting the barcode list. I agree with you that we should expect an initial high CPU load and then a decrease. I do not know why we do not see this. Concerning parallelising over offsets instead of barcodes, we initially did not think it necessary since barcodes had few reads in our initial read datasets and query time was so low for only one barcode. But this is definitely an interesting avenue to explore for future developments.

Thank you for these useful comments, and by the way thank you also for creating and keeping up to date the awsome repository https://github.com/pontushojer/awesome-linked-reads !

Thanks for the kind words!

morispi / LRez

bam query slow #11