lh3 / miniprot

Align proteins to genomes with splicing and frameshift
https://lh3.github.io/miniprot/
MIT License
312 stars 17 forks source link

Indexing is much slower on fragmented assemblies #10

Closed twrightsman closed 1 year ago

twrightsman commented 1 year ago

I have noticed that indexing takes much longer (>1 hr) on fragmented assemblies that have a few million contigs. Assemblies of similar total assembly length but with hundreds of contigs are much faster, being indexed in minutes. The total assembly sizes are around a gigabase.

This may be expected behavior, and possibly related to https://github.com/lh3/miniprot/issues/3.

lh3 commented 1 year ago

This is not expected. Do you happen to have the link to a fragmented assembly such that I can play with?

lh3 commented 1 year ago

I downloaded a human short-read assembly with 361,157 contigs. It took 2 minutes to index, comparable to indexing the human reference genome.

[M::mp_ntseq_read@27.752*1.00] read 2806031133 bases in 361157 contigs
[M::mp_idx_build@27.756*1.00] 22263202 blocks
[M::mp_idx_build@39.203*5.37] collected syncmers
[M::mp_idx_build@114.259*2.50] 847438891 kmer-block pairs
[M::mp_idx_print_stat] 1129657 distinct k-mers; mean occ of infrequent k-mers: 717.87; 313 frequent k-mers accounting for 36720278 occurrences
[M::main] Version: 0.3-r137

I guess something else is causing the slow indexing on your end. What sequences are you indexing?

twrightsman commented 1 year ago

I am indexing short-read-based assemblies of grass genomes, which are unfortunately not public yet; I will re-run two of the assemblies, one fragmented and one not, to verify this wasn't due to something like random storage latency on the compute cluster.

If you want me to run any diagnostic tools on the assemblies, like k-mer statistics, just let me know!

twrightsman commented 1 year ago

Here is the time output from indexing one of the "problematic" short-read assemblies:

[M::mp_ntseq_read@5.642*1.00] read 2526970524 bases in 5355222 contigs
[M::mp_idx_build@5.682*1.00] 25107504 blocks
[M::mp_idx_build@15819.094*8.00] collected syncmers
[M::mp_idx_build@15839.473*7.99] 772920789 kmer-block pairs
[M::mp_idx_print_stat] 1127614 distinct k-mers; mean occ of infrequent k-mers: 677.41; 120 frequent k-mers accounting for 9148986 occurrences
[M::main] Version: 0.3-r142-dirty
[M::main] CMD: ./miniprot/miniprot -t8 -d short-read-genome.mpi short-read-genome.fa
[M::main] Real time: 15842.331 sec; CPU: 126531.617 sec; Peak RSS: 19.614 GB
        Command being timed: "./miniprot/miniprot -t8 -d short-read-genome.mpi short-read-genome.fa"
        User time (seconds): 126512.43
        System time (seconds): 19.17
        Percent of CPU this job got: 798%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 4:24:02
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 20566688
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 997655
        Voluntary context switches: 29
        Involuntary context switches: 217125
        Swaps: 0
        File system inputs: 0
        File system outputs: 8986760
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

The long-read ones index just fine (<2 mins):

[M::mp_ntseq_read@9.551*1.00] read 4598175493 bases in 535 contigs
[M::mp_idx_build@9.551*1.00] 35923756 blocks
[M::mp_idx_build@37.390*6.15] collected syncmers
[M::mp_idx_build@50.884*4.79] 1479416713 kmer-block pairs
[M::mp_idx_print_stat] 1129322 distinct k-mers; mean occ of infrequent k-mers: 1288.04; 348 frequent k-mers accounting for 25252416 occurrences
[M::main] Version: 0.3-r142-dirty
[M::main] CMD: ./miniprot/miniprot -t8 -d long-read-genome.mpi long-read-genome.fa
[M::main] Real time: 90.484 sec; CPU: 249.604 sec; Peak RSS: 22.869 GB
        Command being timed: "./miniprot/miniprot -t8 -d long-read-genome.mpi long-read-genome.fa"
        User time (seconds): 238.54
        System time (seconds): 11.06
        Percent of CPU this job got: 275%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 1:30.48
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 23979400
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 2047523
        Voluntary context switches: 41
        Involuntary context switches: 728
        Swaps: 0
        File system inputs: 0
        File system outputs: 16310512
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0
lh3 commented 1 year ago

There may be something unique to your short-read assembly. In #13, someone indexed a 22.5 Gb short-read assembly in half an hour, which is about the right time.

Could you run indexing under the perf command with:

perf record miniprot -t8 -d short-read-genome.mpi short-read-genome.fa
perf report

And then take a screenshot of perf report? This will help me to pinpoint the exact function that causes the slowness. Thank you. I will also have a look at the source code on my part but without actual data, it may be difficult to identify the problem.

twrightsman commented 1 year ago

Here are the screenshots for the short-read... short-read ...and the long-read... long-read indexing. I let the short-read indexing run for about 15 minutes to collect plenty of samples, while the long-read indexing finished completely during sampling.

lh3 commented 1 year ago

This is very helpful. I will have a look when I get time.

lh3 commented 1 year ago

I believe this has been fixed on the github HEAD. Let me know if you still have the same issue. Thank you for the perf run. Very informative.

twrightsman commented 1 year ago

From 4 hours to 40 seconds! Thank you for your time and glad I could help in some way.