kevlar-dev / kevlar

Reference-free variant discovery in large eukaryotic genomes
https://kevlar.readthedocs.io
MIT License
41 stars 9 forks source link

Profiling results #84

Open standage opened 7 years ago

standage commented 7 years ago

At the moment kevlar is painfully slow. We've been philosophizing for a while now whether this was more likely due to the poor cache locality of khmer's Count-Min Sketch implementation or to slow Fastq parsing/handling, or some other factor or combination of factors. It's time to evaluate this empirically.

I profiled both kevlar count and kevlar novel on the new simulated data set I created in #83. It's small enough for kevlar to process in several minutes, but large enough to observe meaningful numbers in the profile. For kevlar novel, I computed k-mer counts from scratch, not using precomputed counts.

Here are the top 25 functions, sorted by time spent, for each.

# kevlar count
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)                                                                         
 51921279   61.111    0.000   61.111    0.000 {method 'get' of '_khmer.KHashtable   ' objects}                                                  
 36053659   46.578    0.000   46.578    0.000 {method 'add' of '_khmer.KHashtable   ' objects}                                                  
        3   38.588   12.863  187.651   62.550 counting.py:22(load_sample_seqfile)                                                               
   887492   29.514    0.000   29.514    0.000 {method 'get_kmers' of '_khmer.KHashtable   ' objects}                                            
   887495    5.401    0.000    5.401    0.000 __init__.py:97(multi_file_iter_khmer)                                                                887492    2.150    0.000    2.150    0.000 {method 'split' of '_sre.SRE_Pattern' objects}                                     
  1774984    1.703    0.000    6.031    0.000 __init__.py:103(clean_subseqs)                                                                    
   887819    1.123    0.000    1.157    0.000 re.py:278(_compile)                                                                
   887492    0.905    0.000    4.177    0.000 re.py:195(split)                                                                                  
        3    0.323    0.108    0.323    0.108 {method 'save' of '_khmer.KHashtable   ' objects}               
    82/54    0.154    0.002    0.209    0.004 {built-in method _imp.create_dynamic}           
896021/895669    0.153    0.000    0.153    0.000 {built-in method builtins.len}
      257    0.101    0.000    0.101    0.000 {built-in method __new__ of type object at 0x10ffa3080}
      522    0.050    0.000    0.050    0.000 {built-in method marshal.loads}
        3    0.027    0.009    0.027    0.009 {method 'read' of '_io.BufferedReader' objects}
    803/1    0.021    0.000  188.297  188.297 {built-in method builtins.exec}
     2761    0.019    0.000    0.019    0.000 {built-in method posix.stat}
  613/609    0.016    0.000    0.028    0.000 {built-in method builtins.__build_class__}
      243    0.015    0.000    0.015    0.000 {built-in method builtins.compile}
     1190    0.015    0.000    0.064    0.000 <frozen importlib._bootstrap_external>:1215(find_spec)
      522    0.012    0.000    0.018    0.000 <frozen importlib._bootstrap_external>:816(get_data)
        2    0.008    0.004    0.021    0.010 machar.py:116(_do_init)
     6226    0.007    0.000    0.019    0.000 <frozen importlib._bootstrap_external>:50(_path_join)
     6226    0.007    0.000    0.010    0.000 <frozen importlib._bootstrap_external>:52(<listcomp>)
      704    0.007    0.000    0.087    0.000 <frozen importlib._bootstrap>:879(_find_spec)

# kevlar novel
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
 67446100   80.216    0.000   80.216    0.000 {method 'add' of '_khmer.KHashtable   ' objects}
  1183526   36.393    0.000   36.393    0.000 {method 'get_kmers' of '_khmer.KHashtable   ' objects}
 29633082   34.668    0.000   34.668    0.000 {method 'get' of '_khmer.KHashtable   ' objects}
        3   25.096    8.365  142.916   47.639 ./kevlar/counting.py:22(load_sample_seqfile)
 22496672   18.973    0.000   54.691    0.000 ./kevlar/novel.py:23(kmer_is_interesting)
        1   14.966   14.966  234.252  234.252 ./kevlar/novel.py:42(main)
   887495    5.090    0.000    5.090    0.000 ./kevlar/__init__.py:97(multi_file_iter_khmer)
   296035    3.894    0.000   10.359    0.000 /Users/standage/Projects/kevlar/env/lib/python3.5/site-packages/screed/fastq.py:14(fastq_iter)
   887492    1.877    0.000    1.877    0.000 {method 'split' of '_sre.SRE_Pattern' objects}
  1774984    1.387    0.000    4.978    0.000 ./kevlar/__init__.py:103(clean_subseqs)
  1184137    1.336    0.000    1.980    0.000 {method 'readline' of '_io.BufferedReader' objects}
  7784325    1.175    0.000    1.175    0.000 {method 'append' of 'list' objects}
  1183855    1.153    0.000    1.187    0.000 /Users/standage/Projects/kevlar/env/lib/python3.5/re.py:278(_compile)
  1184137    0.998    0.000    3.905    0.000 /usr/local/Cellar/python3/3.5.2_3/Frameworks/Python.framework/Versions/3.5/lib/python3.5/gzip.py:370(readline)
   887492    0.753    0.000    3.462    0.000 /Users/standage/Projects/kevlar/env/lib/python3.5/re.py:195(split)
  1184138    0.635    0.000    0.927    0.000 /usr/local/Cellar/python3/3.5.2_3/Frameworks/Python.framework/Versions/3.5/lib/python3.5/_compression.py:12(_check_not_closed)
   296037    0.544    0.000    0.544    0.000 {method 'search' of '_sre.SRE_Pattern' objects}
  1184137    0.544    0.000    0.968    0.000 /Users/standage/Projects/kevlar/env/lib/python3.5/site-packages/screed/utils.py:4(to_str)
   296034    0.506    0.000    0.625    0.000 /Users/standage/Projects/kevlar/env/lib/python3.5/site-packages/screed/screedRecord.py:23(__init__)
  1184179    0.424    0.000    0.424    0.000 {method 'decode' of 'bytes' objects}
  1186054    0.366    0.000    0.366    0.000 {method 'startswith' of 'str' objects}
2429318/2428966    0.359    0.000    0.359    0.000 {built-in method builtins.len}
     8852    0.317    0.000    0.317    0.000 {method 'decompress' of 'zlib.Decompress' objects}
  1184140    0.292    0.000    0.292    0.000 /usr/local/Cellar/python3/3.5.2_3/Frameworks/Python.framework/Versions/3.5/lib/python3.5/gzip.py:296(closed)
   296037    0.289    0.000    1.154    0.000 /Users/standage/Projects/kevlar/env/lib/python3.5/re.py:170(search)

In both cases, the add and get functions for incrementing and querying k-mer counts from the Count-Min sketch dominate the runtime, with the get_kmers and re.split functions as heavy hitters as well. The latter two have to do with the fact that the khmer's bulk loading functions don't support the kind of preprocessing we need for kevlar, so I'm doing it in Python. This incurs overhead in sending the data from C++ to Python objects, and then doing the processing in Python.

As far as priorities, I'm not sure there's much we can do about the Count-Min sketch implementation. We could try to implement buffering (collect N≈1e4 add operations before actually incrementing the tables) but honestly once the CQF-based counter is integrated we may already get much better performance from that. For the sequence loading, I'll continue to lobby for better support for multiple pre-processing strategies with khmer bulk Fastq loading code.

betatim commented 7 years ago

Can you do your preprocessing on a per read basis or is it per kmer? Wondering if you could speed things along by passing a read to consume_string instead of calling add for each kmer. It might reduce the (notoriously large) function call overhead of python and maybe we make some gains by having the loop in c++?

standage commented 7 years ago

Perhaps. I'd need to create a consume_string_banding function, but probably want to do this anyway,

That said, based on my most recent discussion with Fereydoun, we're probably going to settle for discarding reads with non-ACGT. This represents a small portion of data, and if any of our variant calls are coming only from these potentially problematic reads then that's a cause for skepticism anyway. And if I understand correctly, the Cython-based read pre-processing is moving in this direction anyway, so this should make our life much easier!