Slow queries for human genome

simonepignotti commented 6 years ago

Querying human reads to the human genome (ref) is very slow (~500 RPM, a 1000x slowdown compared to querying bacterial reads to the same index).

Technical reason

This is due to highly repeted k-mers in the index, causing very large portions of the SA to be checked by the function bwa_sa2pos. Re-assembling with prophyle_assembler before creating the index brings the speed up to 25k RPM.

The number of calls to each function, obtained with gprof while querying 1000 simulated reads with (_compact) and without (_repeat) pre-assembly, are attached. human_compact.pdf human_repeat.pdf

Suggested solutions

Since the main use of this would be read filtering, one possible solution is to implement a new command prophex filter which stops querying k-mers as soon as a given threshold of k-mer hits is reached.

Furthermore, we could only check that the SA interval is non-empty instead of retrieving the position, but this would create some false positives due to k-mers on the border of two contigs. Alternatively, once the interval is computed, we can stop retrieving the positions in the text as soon as one k-mer is verified using bwa2pos.

salikhov-kamil commented 6 years ago

@simonepignotti so the output of prophex filter should be exactly the same as for prophex query, the only difference is that we can stop querying k-mers if threshold is reached?

salikhov-kamil commented 6 years ago

concerning second idea: I suggest to stop retrieving positions using sa2pos if number of matched nodes is equal to the total number of nodes. It will cover your case, when there is only one node, but possibly can improve speed in other situations too

simonepignotti commented 6 years ago

the output of prophex filter should be exactly the same as for prophex query, the only difference is that we can stop querying k-mers if threshold is reached?

The best would be to output the reads passing the filter, and discard those which don't. It would not make much sense to use the normal format I guess, since we won't have the information about all k-mers and the output would be incomplete and useless. The threshold should be a float 0 <= t <= 1, and all reads with #matches/len > t should appear in the output.

I suggest to stop retrieving positions using sa2pos if number of matched nodes is equal to the total number of nodes

I think it's a good idea. @karel-brinda , is there any reason why this could go wrong?

EDIT: even if we implement your suggestion, I think we should keep checking for contigs' borders. I imagine the situation of using prophex to index the output unitigs of BCalm or other DBG tools, which can be very fragmented. That would produce a lot of false positives...

karel-brinda commented 6 years ago

The best would be to output the reads passing the filter

We need to keep all reads, even those that don't pass.

It would not make much sense to use the normal format I guess

I think it would be good to keep the normal (Kraken-like) format and just slightly adjust it. For instance, one of the Kraken columns is for classified / unclassified – this would perfectly serve our purposes.

I suggest to stop retrieving positions using sa2pos if number of matched nodes is equal to the total number of nodes

I think it's a good idea. @karel-brinda , is there any reason why this could go wrong?

I don't completely understand why we can do that. Could you please provide some additional explanation?

simonepignotti commented 6 years ago

I think it would be good to keep the standard format

In this case we also need to provide a script to perform the actual filtering (retrieve the name of "classified" reads in Prophex' output and output the raw reads). What about doing both at the same time, instead? Normal output in one file, filtered reads in another one.

Could you please provide some additional explanation?

If I understood correctly, once every node (or ref) has been matched, it is useless to continue querying the index for the same k-mer (the output list won't change, since it already includes every node). When there is only one node, as in the case of the human genome, this means to stop querying the index for that k-mer after its first occurrence, which is exactly what we want. In the case of an index with multiple nodes, this would not affect the correctness of the results, and it could eventually (but very rarely, and only when there are few nodes) speed up the query. @salikhov-kamil is that correct?

salikhov-kamil commented 6 years ago

yes, you are right, Simone

karel-brinda commented 4 years ago

@simonepignotti Do you remember what was the k-mer size?

karel-brinda commented 4 years ago

A quick update: the problem are not repetitive k-mers, but something else.

karel-brinda commented 4 years ago

@simonepignotti How did you make the gprof plots? I would like to do the same also with bwa fastmap to see what's so different.

karel-brinda commented 4 years ago

I did extensive testing and profiling.

1) Some reads are much worse than others. Some reads take several seconds to be processed by ProPhex (time is in seconds):

hg38@chr1_9000644_9001048_0:0:0_0:0:0_d3f] 6.230
hg38@chr1_144281574_144282209_2:0:0_0:0:0_3c8] 5.491
hg38@chr1_13983825_13984340_0:0:0_1:0:0_177] 5.460
hg38@chr1_109221543_109222069_1:0:0_0:0:0_b92] 5.379
hg38@chr1_67477732_67478302_3:0:0_1:0:0_1262] 4.180
hg38@chr1_10982750_10983187_0:0:0_5:0:0_e17] 4.055
hg38@chr1_202922392_202922823_1:0:0_1:0:0_c19] 3.927
hg38@chr1_168456983_168457470_1:0:0_0:0:0_ae8] 3.920
hg38@chr1_56701931_56702372_0:0:0_2:0:0_332] 3.811
...

Example:

hg38@chr1_45044284_45044790_0:0:0_0:0:0_5] 2.102

>hg38@chr1_45044284_45044790_0:0:0_0:0:0_5/1
TTATATATATATATTTATATATATATATATATATATATATATATTTTTTTTTTTTTTTTTTTTTTTTGAGACGGAGTCTGGCTCTGTCGCCCAGGCTGGAGTGCAGTGGCGCGATCTCGGCTCACTGCAAGCTCCGCCTTCCGGGTTCAC

2) Within the bad reads some k-mers are worse than others

Timing

39@hg38@chr1_45044284_45044790_0:0:0_0:0:0_5     0.000 sec
40@hg38@chr1_45044284_45044790_0:0:0_0:0:0_5     0.000 sec
41@hg38@chr1_45044284_45044790_0:0:0_0:0:0_5     0.000 sec
42@hg38@chr1_45044284_45044790_0:0:0_0:0:0_5     0.001 sec
43@hg38@chr1_45044284_45044790_0:0:0_0:0:0_5     0.002 sec
44@hg38@chr1_45044284_45044790_0:0:0_0:0:0_5     0.048 sec
45@hg38@chr1_45044284_45044790_0:0:0_0:0:0_5     0.056 sec
46@hg38@chr1_45044284_45044790_0:0:0_0:0:0_5     0.065 sec
47@hg38@chr1_45044284_45044790_0:0:0_0:0:0_5     0.075 sec
48@hg38@chr1_45044284_45044790_0:0:0_0:0:0_5     0.075 sec
49@hg38@chr1_45044284_45044790_0:0:0_0:0:0_5     0.084 sec
50@hg38@chr1_45044284_45044790_0:0:0_0:0:0_5     0.002 sec
51@hg38@chr1_45044284_45044790_0:0:0_0:0:0_5     0.001 sec
52@hg38@chr1_45044284_45044790_0:0:0_0:0:0_5     0.001 sec
53@hg38@chr1_45044284_45044790_0:0:0_0:0:0_5     0.001 sec
54@hg38@chr1_45044284_45044790_0:0:0_0:0:0_5     0.001 sec
55@hg38@chr1_45044284_45044790_0:0:0_0:0:0_5     0.001 sec

K-mers:

>39@hg38@chr1_45044284_45044790_0:0:0_0:0:0_5/1
ATATATTTTTTTTTTTTTTTTTTTTTTTTGA
>40@hg38@chr1_45044284_45044790_0:0:0_0:0:0_5/1
TATATTTTTTTTTTTTTTTTTTTTTTTTGAG
>41@hg38@chr1_45044284_45044790_0:0:0_0:0:0_5/1
ATATTTTTTTTTTTTTTTTTTTTTTTTGAGA
>42@hg38@chr1_45044284_45044790_0:0:0_0:0:0_5/1
TATTTTTTTTTTTTTTTTTTTTTTTTGAGAC
>43@hg38@chr1_45044284_45044790_0:0:0_0:0:0_5/1
ATTTTTTTTTTTTTTTTTTTTTTTTGAGACG
>44@hg38@chr1_45044284_45044790_0:0:0_0:0:0_5/1
TTTTTTTTTTTTTTTTTTTTTTTTGAGACGG
>45@hg38@chr1_45044284_45044790_0:0:0_0:0:0_5/1
TTTTTTTTTTTTTTTTTTTTTTTGAGACGGA
>46@hg38@chr1_45044284_45044790_0:0:0_0:0:0_5/1
TTTTTTTTTTTTTTTTTTTTTTGAGACGGAG
>47@hg38@chr1_45044284_45044790_0:0:0_0:0:0_5/1
TTTTTTTTTTTTTTTTTTTTTGAGACGGAGT
>48@hg38@chr1_45044284_45044790_0:0:0_0:0:0_5/1
TTTTTTTTTTTTTTTTTTTTGAGACGGAGTC
>49@hg38@chr1_45044284_45044790_0:0:0_0:0:0_5/1
TTTTTTTTTTTTTTTTTTTGAGACGGAGTCT
>50@hg38@chr1_45044284_45044790_0:0:0_0:0:0_5/1
TTTTTTTTTTTTTTTTTTGAGACGGAGTCTG
>51@hg38@chr1_45044284_45044790_0:0:0_0:0:0_5/1
TTTTTTTTTTTTTTTTTGAGACGGAGTCTGG
>52@hg38@chr1_45044284_45044790_0:0:0_0:0:0_5/1
TTTTTTTTTTTTTTTTGAGACGGAGTCTGGC
>53@hg38@chr1_45044284_45044790_0:0:0_0:0:0_5/1
TTTTTTTTTTTTTTTGAGACGGAGTCTGGCT
>54@hg38@chr1_45044284_45044790_0:0:0_0:0:0_5/1
TTTTTTTTTTTTTTGAGACGGAGTCTGGCTC
>55@hg38@chr1_45044284_45044790_0:0:0_0:0:0_5/1
TTTTTTTTTTTTTGAGACGGAGTCTGGCTCT

When queried in isolation:

49@hg38@chr1_45044284_45044790_0:0:0_0:0:0_5     0.324 sec
49@hg38@chr1_45044284_45044790_0:0:0_0:0:0_5     0.088 sec
49@hg38@chr1_45044284_45044790_0:0:0_0:0:0_5     0.083 sec
49@hg38@chr1_45044284_45044790_0:0:0_0:0:0_5     0.085 sec
49@hg38@chr1_45044284_45044790_0:0:0_0:0:0_5     0.082 sec
49@hg38@chr1_45044284_45044790_0:0:0_0:0:0_5     0.081 sec
49@hg38@chr1_45044284_45044790_0:0:0_0:0:0_5     0.081 sec
49@hg38@chr1_45044284_45044790_0:0:0_0:0:0_5     0.080 sec
49@hg38@chr1_45044284_45044790_0:0:0_0:0:0_5     0.083 sec
49@hg38@chr1_45044284_45044790_0:0:0_0:0:0_5     0.086 sec
49@hg38@chr1_45044284_45044790_0:0:0_0:0:0_5     0.086 sec
49@hg38@chr1_45044284_45044790_0:0:0_0:0:0_5     0.087 sec
49@hg38@chr1_45044284_45044790_0:0:0_0:0:0_5     0.094 sec
49@hg38@chr1_45044284_45044790_0:0:0_0:0:0_5     0.098 sec
49@hg38@chr1_45044284_45044790_0:0:0_0:0:0_5     0.109 sec
49@hg38@chr1_45044284_45044790_0:0:0_0:0:0_5     0.106 sec
49@hg38@chr1_45044284_45044790_0:0:0_0:0:0_5     0.104 sec
49@hg38@chr1_45044284_45044790_0:0:0_0:0:0_5     0.112 sec
49@hg38@chr1_45044284_45044790_0:0:0_0:0:0_5     0.120 sec

3) ProPhex vs BWA fastmap profiling with this k-mer (queried 100x)

ProPhex

 time   seconds   seconds    calls  ms/call  ms/call  name    
 86.58      6.90     6.90  2164600     0.00     0.00  bwt_sa
 12.92      7.93     1.03 67140800     0.00     0.00  bwt_occ
  0.25      7.95     0.02  2164600     0.00     0.00  bns_pos2rid
  0.13      7.96     0.01  2164600     0.00     0.00  bwa_sa2pos
  0.13      7.97     0.01  2164600     0.00     0.00  get_node_from_contig
  0.00      7.97     0.00     3100     0.00     0.00  bwt_2occ
  0.00      7.97     0.00      455     0.00     0.00  add_contig
  0.00      7.97     0.00      294     0.00     0.00  err_fread_noeof
  0.00      7.97     0.00      204     0.00     0.00  realtime
  0.00      7.97     0.00      202     0.00     0.00  ks_getuntil2
  0.00      7.97     0.00      200     0.00     0.00  strncat_with_check
  0.00      7.97     0.00      200     0.00     0.00  wrap_strdup
  0.00      7.97     0.00      116     0.00     0.00  wrap_malloc
  0.00      7.97     0.00      102     0.00     0.00  kseq_read

BWA fastmap

 time   seconds   seconds    calls  Ts/call  Ts/call  name    
  0.00      0.00     0.00     6000     0.00     0.00  bwt_occ4
  0.00      0.00     0.00     3000     0.00     0.00  bwt_2occ4
  0.00      0.00     0.00     3000     0.00     0.00  bwt_extend
  0.00      0.00     0.00      910     0.00     0.00  wrap_strdup
  0.00      0.00     0.00      294     0.00     0.00  err_fread_noeof
  0.00      0.00     0.00      202     0.00     0.00  fread_fix
  0.00      0.00     0.00      201     0.00     0.00  ks_getuntil2
  0.00      0.00     0.00      200     0.00     0.00  err_fputc
  0.00      0.00     0.00      200     0.00     0.00  err_printf
  0.00      0.00     0.00      200     0.00     0.00  err_puts
  0.00      0.00     0.00      200     0.00     0.00  smem_next
  0.00      0.00     0.00      101     0.00     0.00  kseq_read
  0.00      0.00     0.00      100     0.00     0.00  bwt_smem1a
  0.00      0.00     0.00      100     0.00     0.00  smem_set_query
  0.00      0.00     0.00       15     0.00     0.00  wrap_calloc

Graph for ProPhex:

karel-brinda commented 4 years ago

Profiling results for ProPhex vs. BWA fastmap for the worst k-mer (queried 100x): profile_fastmap_difficult_read_1_worst_kmer.complete.pdf profile_prophex_difficult_read_1_worst_kmer.complete.pdf

karel-brinda commented 4 years ago

Suffix array for the k-mer from above (TTTTTTTTTTTTTTTTTTTGAGACGGAGTCT):

[prophex:calculate_sa_interval .1] i=0 c=3 k=4539333982 l=6418572210 l-k=1879238228
[prophex:calculate_sa_interval .1] i=1 c=1 k=2761983427 l=3209286105 l-k=447302678
[prophex:calculate_sa_interval .1] i=2 c=3 k=5198118155 l=5339020178 l-k=140902023
[prophex:calculate_sa_interval .1] i=3 c=2 k=4323085135 l=4347238055 l-k=24152920
[prophex:calculate_sa_interval .1] i=4 c=0 k=1324382888 l=1331473074 l-k=7090186
[prophex:calculate_sa_interval .1] i=5 c=2 k=3491824866 l=3493578234 l-k=1753368
[prophex:calculate_sa_interval .1] i=6 c=2 k=3947466886 l=3948077429 l-k=610543
[prophex:calculate_sa_interval .1] i=7 c=1 k=2722352611 l=2722484491 l-k=131880
[prophex:calculate_sa_interval .1] i=8 c=0 k=830500363 l=830614701 l-k=114338
[prophex:calculate_sa_interval .1] i=9 c=2 k=3377186124 l=3377290508 l-k=104384
[prophex:calculate_sa_interval .1] i=10 c=0 k=1009703143 l=1009802683 l-k=99540
[prophex:calculate_sa_interval .1] i=11 c=2 k=3415289782 l=3415380212 l-k=90430
[prophex:calculate_sa_interval .1] i=12 c=3 k=5404768974 l=5404853733 l-k=84759
[prophex:calculate_sa_interval .1] i=13 c=3 k=6076954737 l=6077036093 l-k=81356
[prophex:calculate_sa_interval .1] i=14 c=3 k=6284397221 l=6284475869 l-k=78648
[prophex:calculate_sa_interval .1] i=15 c=3 k=6363431038 l=6363506407 l-k=75369
[prophex:calculate_sa_interval .1] i=16 c=3 k=6392481913 l=6392551992 l-k=70079
[prophex:calculate_sa_interval .1] i=17 c=3 k=6403378710 l=6403442947 l-k=64237
[prophex:calculate_sa_interval .1] i=18 c=3 k=6408038539 l=6408097103 l-k=58564
[prophex:calculate_sa_interval .1] i=19 c=3 k=6410415243 l=6410469074 l-k=53831
[prophex:calculate_sa_interval .1] i=20 c=3 k=6411930558 l=6411981473 l-k=50915
[prophex:calculate_sa_interval .1] i=21 c=3 k=6413051658 l=6413100140 l-k=48482
[prophex:calculate_sa_interval .1] i=22 c=3 k=6413932009 l=6413978095 l-k=46086
[prophex:calculate_sa_interval .1] i=23 c=3 k=6414668720 l=6414712340 l-k=43620
[prophex:calculate_sa_interval .1] i=24 c=3 k=6415299831 l=6415340952 l-k=41121
[prophex:calculate_sa_interval .1] i=25 c=3 k=6415840952 l=6415879019 l-k=38067
[prophex:calculate_sa_interval .1] i=26 c=3 k=6416302381 l=6416337123 l-k=34742
[prophex:calculate_sa_interval .1] i=27 c=3 k=6416692460 l=6416723634 l-k=31174
[prophex:calculate_sa_interval .1] i=28 c=3 k=6417019522 l=6417047105 l-k=27583
[prophex:calculate_sa_interval .1] i=29 c=3 k=6417293252 l=6417317753 l-k=24501
[prophex:calculate_sa_interval .1] i=30 c=3 k=6417523807 l=6417545452 l-k=21645

This corresponds to the true number of occurrences of this k-mer:

Karels-iMac:hg_oneline karel$ cat one_line_hg38.fna | grep -o TTTTTTTTTTTTTTTTTTTGAGACGGAGTCT | wcl
   10789
Karels-iMac:hg_oneline karel$ cat one_line_hg38.fna | grep -o AGACTCCGTCTCAAAAAAAAAAAAAAAAAAA | wcl
   10857

Sum = 21646

This means that bwa fastmap must be using some tricks to highly optimize the sa -> pos translation.

simonepignotti commented 4 years ago

@simonepignotti Do you remember what was the k-mer size?

I am pretty sure it was k=25, but the tests for other values of k were comparable

@simonepignotti How did you make the gprof plots? I would like to do the same also with bwa fastmap to see what's so different.

I used https://github.com/jrfonseca/gprof2dot but I see you figured it out, sorry for the late reply!

This means that bwa fastmap must be using some tricks to highly optimize the sa -> pos translation.

I am sure we can implement a similar trick, but if I remember correctly the filter command implemented by @salikhov-kamil in https://github.com/prophyle/prophex/tree/kamil/prophex_filter should skip the sa -> pos translation when the only information needed is whether the index contains a k-mer or not. It's a different problem, but may be useful for some specific use cases like filtering human reads. @karel-brinda, have you tried to compare prophex filter to bwa fastmap?

EDIT: see #16 and #22 for more details about the filter command

karel-brinda commented 4 years ago

Now I've finished a profiling experiment with simplitigs. Unfortunately, I couldn't reproduce the results from the plot, where all 3 modes had almost the same times (here, the rolling window is the fastest as it should be).

I took hg38-simplitigs31.fa.gz from https://zenodo.org/record/3770419, prepended hg38@ to sequence names, took 100,000 Illumina 150 bp read, and profiled fastmap, prophex-rolling, and prophex-restarted.

Fastmap:

[main] Version: 0.7.15-r1142-dirty
[main] CMD: ../prophex/src/bwa/bwa fastmap -l 31 ../hg38-simplitigs31.fa.gz ../reads100x.31mers.fa
[main] Real time: 371.220 sec; CPU: 352.387 sec

Prophex-rolling

[prophex:query] match time: 28.87 sec
[prophex::query] Processed 100000 reads in 28.522 CPU sec, 28.871 real sec

Prophex-rolling

[prophex:query] match time: 15.44 sec
[prophex::query] Processed 100000 reads in 15.227 CPU sec, 15.440 real sec

I'm attaching the profile outputs:

profile_fastmap_reads100x.complete.pdf profile_fastmap_reads100x.partial.pdf profile_prophex-res_reads100x.complete.pdf profile_prophex-res_reads100x.partial.pdf profile_prophex-rol_reads100x.complete.pdf profile_prophex-rol_reads100x.partial.pdf

karel-brinda commented 4 years ago

The most important observation from today:

Even with simplitigs, there are k-mers for which querying is super slow. Again, it's due to very slow SA->master string coordinate conversion, due to a too large SA interval, no idea why. For instance, the k-mer TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT has a suffix array interval of length 12878 (see the attachment).

I have absolutely no idea how this can be possible. The experiment was done with the simplitig index, i.e., this k-mer is present only once in the FASTA. Moreover, even if I merge all the simplitigs into a single one, this k-mer still appears only once (i.e., it's not due to fake k-mers on the borders).

@salikhov-kamil Kamil, do you have any idea about what could be going on? I'm attaching a detailed ProPhex log.

hardest_kmer_SAs.txt

karel-brinda commented 4 years ago

Interestingly, 32mers with all possible extension are matched in the k=32 mode exactly as they should be (31T_ACTG.txt). The only explanation I can come up with is that BWA uses some weird stretches of sequences between contigs. @salikhov-kamil What do you think?

prophyle / prophex

Slow queries for human genome #13

Technical reason

Suggested solutions