Low CPU usage for nanopore data

rlorigro commented 3 years ago

Because of how minimap2 creates batches, each batch takes as long as the longest read takes to map. This is particularly poorly suited for HMW nanopore sequencing, since its read length distribution has a long right tail. This effectively creates the worst case scenario for CPU efficiency, where the vast majority of each batch is mapped long before the longest read finishes. My latest (winnowmap) run took 38hrs to map:

The average CPU usage was 25% (up to the beginning of sorting) and the cost is $138 on the lowest memory AWS instance that supports 96 threads.

Can we have an option to sort batches by length, or use dynamic load balancing?

lh3 commented 3 years ago

The minimap2 behavior is the result of reproducible output under limited memory. Better multithreading efficiency would either require unstable memory usage or generate outputs in random order.

If you have large memory, use a large -K (e.g. -K4g) to increase the batch size. If cost matters, use an instance with ~16 vCPUs. Minimap2 2.20 performs better in highly repetitive regions than older versions and it is actually more accurate than winnowmap on simulated nanopore data. You can try minimap2 to see if it meets your goal. Minimap2 is a few times faster.

rlorigro commented 3 years ago

This was using -K1.9g. I would guess that most users don't care about reproducible ordering because the order is immediately destroyed during sorting and indexing. Memory is also not really a concern if threading works at a high CPU efficiency, because the added cost/hr is countered by the faster run time.

I take your point about winnowmap vs minimap though. Is there an experiment you can share that shows this result?

lh3 commented 3 years ago

The order is not "immediately destroyed" with stable sorting. Exact reproducibility is an important requirement in government approved pipelines and in many production pipelines. Unstable memory is a big concern in a cluster setting. With maximum thread efficiency in stable output order, minimap2 could use >100GB memory in rare cases and this is unpredictable. That will cause much more problems. In general, there is not a solution perfect in all scenarios. As I said, using ~16 vCPUs will be more cost effective. Minimap2 v2.18+ also supports large -K, which may have a big effect on huge datasets.

On simulated nanopore results, see this comment. On highly repetitive regions, minimap2 and winnowmap have similar evenness of coverage on CHM13 T2T. If winnowmap hasn't incorporated recent changes in minimap2 v2.20, minimap2 will be more sensitive at aligning kb-long indels.

tpesout commented 3 years ago

It's a fair point about determinism being critical in some contexts. That being said, for most of the cases where I use minimap, cost and time efficiency are what I care about. If you would prefer to not add a dynamic batching option, could you recommend a place in the code where we might fork the repo and make this change for our own internal use?

Separately, if the input reads were sorted by length, this would maintain reproducibility/determinism and mitigate the issue where one long read dominates the runtime for a batch. If batches are determined by total read size instead of read count (maybe this is not the case), my instinct is that a memory explosion wouldn't happen. Does that seem right, or am I missing something? I don't know of a tool that does this, but if there were it might provide a resolution for this.

rlorigro commented 3 years ago

I'll just add that the smallest RAM size for any AWS instance with 64 or more threads is already 128GB.

For 96 threads, the cost of doubling the memory is only a 30% increase in hourly price: 192GB - $3.70 384GB - $4.60

lh3 commented 3 years ago

I encourage you to try -K10g or -K20g first. This may greatly help multi-threading efficiency. If adding the option doesn't help much, come back to me and I will give you more details on how to implement a new multi-threading model on your end.

lh3 commented 3 years ago

PS: if you use -K10g, it may be better to pipe zcat output to minimap2 to parallelize decompression and fastq parsing for the first patch:

zcat reads.fq.gz | minimap2 -cxmap-ont ref.fa - > output.paf

tpesout commented 3 years ago

We'll try this out for our next few runs and see how it goes. Thanks!

rlorigro commented 3 years ago

Here are some results adjusting the -K parameter. The fastqs were already decompressed so I didn't use zcat.

minimap2_winnowmap_hg002_results (x = time in minutes)

These were run on a 64 thread, ~128GB RAM machine, using these commands for minimap2 and winnowmap:

minimap2 \
-x map-ont \
-t 62 \
--secondary=no \
-a \
-K[n]g \
/home/ubuntu/data/human/reference/hifi/hg002/HG002.maternal.f1_assembly_v2.1.fa \
input.fastq \
-o /dev/null

winnowmap \
-W repetitive_k15.txt \
-x map-ont \
-t 62 \
--secondary=no \
--sv-off \
-a \
-K[n]g \
/home/ubuntu/data/human/reference/hifi/hg002/HG002.maternal.f1_assembly_v2.1.fa \
input.fastq \
-o /dev/null"

There is definitely an improvement in Minimap2 using -K5g compared to the default (168min vs 393min), however, -K10g caused the instance to run out of memory. The total run time and CPU usage for Winnowmap doesn't change as expected.

I also ran some alignments on a different library, and found that the memory demand was much greater: minimap2_cuttlefish_results

(x = time in minutes)

This instance had 48 threads and 192GB, so the available memory per thread was much higher, but it still crashed, even with a lower -K value than the previous experiment. The command looked like this:

minimap2 \
-x map-ont \
-t 46 \
--secondary=no \
-n 10 \
-a \
-K[n]g \
--split-prefix /tmp/cuttlefish_guppy507_reads_vs_shasta_run1 \
ref.fasta \
reads.fastq \
| samtools view -bh - | samtools sort - -o output.sorted.bam

I'm currently rerunning with 1g to see if I can get it to finish without crashing.

lh3 commented 3 years ago

Thanks. Lowering peak memory might be easier. On the dev branch, which is will be merged into master, I added a new option --cap-kalloc=NUM option which resets the thread-local storage if it is larger than NUM. I wonder if adding --cap-kalloc=500m (or perhaps --cap-kalloc=1g) may reduce the peak memory. On my server, having --cap-kalloc=500m actually speeds up alignment, which is kinda counter-intuitive...

lh3 commented 3 years ago

PS: here are the minimap2 log with and without --cap-kalloc=500m on a small fastq file (PS: added --cap-kalloc=100m):

## without --cap-kalloc
[M::main] Version: 2.21-dev-r1084-dirty
[M::main] CMD: minimap2 -ct8 hs38-ont.mmi ont-real.fq.gz
[M::main] Real time: 48.037 sec; CPU: 266.825 sec; Peak RSS: 12.529 GB
2d9a846c1c5f24344b78b43cf7075f41  -

## with --cap-kalloc=500m
[M::main] Version: 2.21-dev-r1084-dirty
[M::main] CMD: minimap2 --cap-kalloc=500m -ct8 hs38-ont.mmi ont-real.fq.gz
[M::main] Real time: 47.074 sec; CPU: 260.876 sec; Peak RSS: 10.876 GB
2d9a846c1c5f24344b78b43cf7075f41  -

## with --cap-kalloc=100m
[M::main] Version: 2.21-dev-r1084-dirty
[M::main] CMD: /homes6/hli/minimap2-utec/minimap2 --cap-kalloc=100m -ct8 hs38-ont.mmi ont-real.fq.gz
[M::main] Real time: 47.433 sec; CPU: 264.397 sec; Peak RSS: 9.155 GB
2d9a846c1c5f24344b78b43cf7075f41  -

You can see the peak memory is reduced from 12.529 GB to 10.876 GB. The difference will be larger with more threads and a larger batch size. The md5sum shows that the output is identical with or without this option.

rlorigro commented 3 years ago

What does it mean to reset the thread-local storage? It frees whatever memory was used for previous jobs within the batch and begins reallocating?

Could it be that it was faster because there is less memory fragmentation?

lh3 commented 3 years ago

It was not really faster. Probably just run-to-run difference. Sorry for that. But still, on this small example, adding the option reduces memory without affecting performance. I wonder how it works on large real datasets.

What does it mean to reset the thread-local storage?

Minimap2 manages thread-local allocation with kalloc (e.g. SW matrix and many transient temporary arrays). Kalloc requests memory with malloc() but it doesn't automatically give the memory back to glibc or system. You have to call km_destroy() to free all memory allocated to a kalloc handler. This new option calls km_destroy() if the kalloc memory used in a thread is above a threshold.

It frees whatever memory was used for previous jobs within the batch and begins reallocating?

Kalloc memory is freed at the end of each batch. However, if a batch is too large, kalloc memory in each thread may continuously grow. If you have 64 threads and each thread holds 2GB thread-local storage, you will have 128GB on kalloc memory alone.

EDIT: more exactly, after finishing the alignment of one query sequence, minimap2 checks if the kalloc memory used in that working thread is above the --cap-kalloc threshold. If it is, minimap2 frees all kalloc memory and initializes a clean kalloc handler, which is 8MB at the beginning. When minimap2 processes the next query in the same thread, kalloc memory will grow again.

rlorigro commented 3 years ago

I see, thanks for the explanation. That sounds like it could help. If cap-kalloc works, would there be any reason not to just load all the reads in one batch and let it reset whenever the memory threshold is reached? Also what would happen if the user set the cap to a value that is below that needed for 1 read? (EDIT I suppose this wouldn't be an issue because the check is performed after a read is finished)

lh3 commented 3 years ago

would there be any reason not to just load all the reads in one batch and let it reset whenever the memory threshold is reached?

In this case, minimap2 would stage the entire input and output in memory, which will take a lot of RAM.

Also what would happen if the user set the cap to a value that is below that needed for 1 read? (EDIT I suppose this wouldn't be an issue because the check is performed after a read is finished)

I worry there might be too many parallel heap allocations which may affect the multi-threading efficiency. I could be wrong, though.

rlorigro commented 3 years ago

Going back to the run that crashed, I was able to finish it by using -K1g, but there are some really odd details in the CPU/RAM usage plot: (x = Time in minutes)

If I am interpreting this correctly, the low CPU regions (at around 1700-2000min) seem to imply that a single read was taking hours to map. Is that possible? (1 thread is occupied by the resource monitoring, and presumably some is occupied by samtools sort/view)

EDIT: If it helps to have the IO activity, here is the full plot:

lh3 commented 3 years ago

Thank you. Most of time when minimap2 spends a long time on a single query, the alignment would look bad anyway. It is not worth spending the time on them. As such, minimap2 implements a few heuristics (three, IIRC) to jump out of the alignment loop early if it takes too long. Could you identify the batch of reads that take a long time to finish based on the minimap2 log? If you can send me the reads, I can implement a new heuristic to prevent this from happening again.

lh3 commented 3 years ago

Just FYI. I fixed a memory related issue in #749. Nonetheless, I don't know how much it helps to reduce memory for human data.

The long running time of some reads is unrelated to this. If you can identify these offending reads, it will be great. I tried minimap2 on NA12878 ultra-long reads but could not find reads taking extra long time. Thanks.

rlorigro commented 3 years ago

yes, sorry for the lack of response. The data is not officially public and it is such a large library that it makes it a bit difficult to narrow down. I could try a fractioning or binary partitioning approach on the fastq file, I suppose.

rlorigro commented 3 years ago

Is there any option for verbose output or logging from Minimap that would tell the time/ram per read id?

lh3 commented 3 years ago

I have just added a few lines to the dev branch to log the wall-clock time for each read. You can enable the log with --print-qname and find the timing on QT lines. This option also prints the state of kalloc for each read to stderr.

rlorigro commented 3 years ago

awesome, thanks, I will rerun this and find the offending reads :+1:

rlorigro commented 3 years ago

Alright, here are the results:

Top 10 duration:
('44f0ce23-63e7-4426-ae20-6cbbde1b8b1c', 23958.287681)
('31fbb352-6c30-43eb-a274-023f102cb38e', 9798.81386)
('0bb2c8fd-045a-43d5-8794-2a7bce988e35', 9197.325375)
('44f0ce23-63e7-4426-ae20-6cbbde1b8b1c', 9037.451345)
('73680f95-095a-49ec-9548-97c8fddf0cb7', 6925.653056)
('0628de89-7de6-4d90-9ef3-1d662a2c738f', 5607.447373)
('0506b03e-567a-4ac3-9ea2-c2b6e9cd787f', 5557.707617)
('20822ee5-285a-4a5a-a529-8aa89e7abe7d', 5006.369922)
('f0cd720d-dac9-4557-b859-bd336bedafa8', 4842.659482)
('033cbc18-6c98-49b1-bb00-68030696c367', 4814.012646)

Top 10 memory usage:
('44f0ce23-63e7-4426-ae20-6cbbde1b8b1c', 20325597184)
('31fbb352-6c30-43eb-a274-023f102cb38e', 8564768768)
('0bb2c8fd-045a-43d5-8794-2a7bce988e35', 8027897856)
('44f0ce23-63e7-4426-ae20-6cbbde1b8b1c', 7692353536)
('73680f95-095a-49ec-9548-97c8fddf0cb7', 5544869888)
('0628de89-7de6-4d90-9ef3-1d662a2c738f', 4898947072)
('20822ee5-285a-4a5a-a529-8aa89e7abe7d', 4362076160)
('f0cd720d-dac9-4557-b859-bd336bedafa8', 4152360960)
('0506b03e-567a-4ac3-9ea2-c2b6e9cd787f', 4135583744)
('033cbc18-6c98-49b1-bb00-68030696c367', 3917479936)

It looks like memory and time consumption (in terms of rank) are almost identical, so I have just uploaded the top 10 slowest. I will email the links separately.

The file is about 10Mbp for 10 reads, so that is already a possible explanation for why these are so slow. Though some datasets have hundreds of >Mbp length reads, so it may be that these are particularly difficult for some other reason, like chimerism, repeats, or maybe something about the target sequence.

lh3 commented 3 years ago

Thanks a lot for the examples. It is caused by highly repetitive k-mers on a query sequence. Winnowmap2 is less affected because it chooses one minimizer in a long tandem repeat. I believe the github HEAD should have addressed the issue. Please have a try. If it works, I will run more tests and then cut a new release.

By the way, for your reference genome, it is recommended to add option -I8g to build a single-part index. That will be faster. Furthermore, you may consider to add -k17 to increase the k-mer size. When minimap2 was developed, the Nanopore error rate was ~15%. With lower error rate these days, we can use less sensitive settings. This will also speed up mapping.

rlorigro commented 3 years ago

I believe the github HEAD should have addressed the issue.

To clarify, these reads will be aborted if they take too long? or has the minimizer selection changed?

Please have a try. If it works, I will run more tests and then cut a new release.

rerunning the full experiment without changing any other parameters is still fairly expensive, because of the size of the library and the CPU usage, so I am not sure I can justify it to my lab unfortunately.

for your reference genome, it is recommended to add option -I8g to build a single-part index

I see, thanks for the heads up. I didn't realize there was a default size cutoff.

you may consider to add -k17 to increase the k-mer size

Good point, thanks for the recommendation.

lh3 commented 3 years ago

I am not sure I can justify it to my lab unfortunately.

I understand. Thank you so much for running the previous experiments. The examples you sent me are very useful. I will try more real data on my end. Although as I said the long running time issue is not obvious on my data, I should be able to see some improvements.

To clarify, these reads will be aborted if they take too long? or has the minimizer selection changed?

Minimizer selection is changed and I think that is the right way. Older minimap2 sets a threshold for k-mer occurrences on the reference genome but not on the query sequence. For the worst sequence in your set, which should be a sequencing artifact, there are >100,000 poly-G k-mers. This single poly-G k-mer causes all the problems. Such k-mers should be filtered out.

rlorigro commented 3 years ago

No worries, thanks for taking the time to update Minimap.

Minimizer selection is changed ...

Awesome, that sounds good. And that is a very interesting sequencing error, I might look more into it.

rlorigro commented 3 years ago

We will probably be running similar alignments in the future, as this project continues, though, so there will be more opportunities to align this data. Has the cap-kalloc flag made it into master?

lh3 commented 3 years ago

Yes, --cap-kalloc is available in the master but it is not enabled by default. I plan to set a default value in future after more testing.

rlorigro commented 2 years ago

Hi, Just a follow-up on the latest dataset, we now have roughly double the coverage, and double the N50, and running with K4g and cap-kalloc=2000m is working well:

It finishes quite a bit faster than the K1g run with half the data (earlier in this issue), and the peak RAM usage is less.

Here is the full command:

minimap2 \
-x map-ont \
-t 62 \
--secondary=no \
-n 10 \
-a \
-K 4g \
-k 17 \
-I 8g \
--cap-kalloc=2000m

Admittedly not a well-controlled test compared to the previous run, but I would say it is still possible to conclude that it is working. I think 114GB is a reasonable RAM usage for 62 threads, especially considering that the smallest available instance on EC2 with 64threads is a 128GB machine, and it could probably be reduced by using a smaller kalloc cap.

rlorigro commented 2 years ago

Here is another run that used 128 threads and -K8g on the same dataset: You can see that there are fewer total batches needed, because the CPU usage plot has fewer up/down cycles. I think this means we could go even higher for -K and keep cap-kalloc the same

The full command:

minimap2 \
-x map-ont \
-t 126 \
--secondary=no \
-n 10 \
-a \
-K 8g \
-k 17 \
-I 8g \
--cap-kalloc=2000m

lh3 commented 2 years ago

Thank you so much for the confirmation. Minimap2 v2.23 has just been released with the improvement. I will close the issue now.

lh3 / minimap2

Low CPU usage for nanopore data #771