Closed tmsincomb closed 2 years ago
# Program: tests/unit/anarci/test_anarci.py
# Multi ON
5.856 <module> test_anarci.py:1
└─ 5.853 test_anarci_multi_on test_anarci.py:714
└─ 5.849 run_multiple sadie/anarci/anarci.py:461
├─ 5.584 map multiprocessing/pool.py:359
│ └─ 5.584 get multiprocessing/pool.py:764
│ └─ 5.584 wait multiprocessing/pool.py:761
│ └─ 5.584 wait threading.py:556
│ └─ 5.584 wait threading.py:280
│ └─ 5.584 lock.acquire ../<built-in>:0
└─ 0.190 join multiprocessing/pool.py:656
└─ 0.190 join multiprocessing/process.py:142
└─ 0.190 wait multiprocessing/popen_fork.py:36
└─ 0.190 poll multiprocessing/popen_fork.py:24
└─ 0.190 waitpid ../<built-in>:0
# Multi OFF
35.126 <module> test_anarci.py:1
└─ 35.124 test_anarci_multi_off test_anarci.py:728
└─ 35.121 run_multiple sadie/anarci/anarci.py:461
└─ 35.106 _run sadie/anarci/anarci.py:379
├─ 16.705 number_sequences_from_alignment sadie/anarci/aa/_anarci.py:838
│ └─ 16.426 run_germline_assignment sadie/anarci/aa/_anarci.py:943
│ └─ 16.130 get_identity sadie/anarci/aa/_anarci.py:923
│ ├─ 13.402 [self]
│ └─ 2.682 str.upper ../<built-in>:0
├─ 15.403 run_hmmer sadie/anarci/aa/_anarci.py:709
│ ├─ 9.257 communicate subprocess.py:1090
│ │ └─ 9.257 _communicate subprocess.py:1926
│ │ └─ 9.257 select selectors.py:403
│ │ └─ 9.257 poll.poll ../<built-in>:0
│ └─ 6.112 parse_hmmer_output sadie/anarci/aa/_anarci.py:685
│ └─ 5.879 __iter__ Bio/SearchIO/HmmerIO/hmmer3_text.py:44
│ └─ 5.879 _parse_qresult Bio/SearchIO/HmmerIO/hmmer3_text.py:98
│ ├─ 5.331 _parse_hit Bio/SearchIO/HmmerIO/hmmer3_text.py:160
│ │ └─ 5.155 _create_hits Bio/SearchIO/HmmerIO/hmmer3_text.py:217
│ │ ├─ 2.865 _parse_aln_block Bio/SearchIO/HmmerIO/hmmer3_text.py:328
│ │ │ ├─ 1.492 search re.py:198
│ │ │ │ ├─ 0.854 Pattern.search ../<built-in>:0
│ │ │ │ └─ 0.496 _compile re.py:289
│ │ │ │ └─ 0.413 [self]
│ │ │ └─ 0.483 [self]
│ │ ├─ 0.744 __init__ Bio/SearchIO/_model/hsp.py:754
│ │ │ └─ 0.551 [self]
│ │ └─ 0.379 [self]
│ └─ 0.526 __init__ Bio/SearchIO/_model/query.py:182
│ └─ 0.524 append Bio/SearchIO/_model/query.py:449
│ └─ 0.383 __setitem__ Bio/SearchIO/_model/query.py:336
├─ 1.650 _add_segment_regions sadie/anarci/result.py:73
│ └─ 1.642 apply pandas/core/frame.py:8676
│ └─ 1.642 apply pandas/core/apply.py:694
│ └─ 1.636 apply_standard pandas/core/apply.py:850
│ └─ 1.408 apply_series_generator pandas/core/apply.py:856
│ └─ 1.248 <lambda> sadie/anarci/result.py:98
│ └─ 1.246 _get_region sadie/anarci/result.py:45
│ └─ 0.933 __init__ pandas/core/series.py:323
│ └─ 0.850 _init_dict pandas/core/series.py:463
│ └─ 0.843 create_series_with_explicit_dtype pandas/core/construction.py:822
│ └─ 0.826 __init__ pandas/core/series.py:323
│ └─ 0.509 ensure_index pandas/core/indexes/base.py:6987
│ └─ 0.503 _with_infer pandas/core/indexes/base.py:672
│ └─ 0.406 __new__ pandas/core/indexes/base.py:397
└─ 1.284 parsed_output sadie/anarci/aa/_anarci.py:290
└─ 0.948 __setitem__ pandas/core/series.py:1072
└─ 0.883 __setitem__ pandas/core/indexing.py:705
└─ 0.830 _setitem_with_indexer pandas/core/indexing.py:1553
└─ 0.821 _setitem_with_indexer_missing pandas/core/indexing.py:1941
We can see 16 secs of 1000 seq multi run, with parallel off, is just on
get_identity sadie/anarci/aa/_anarci.py:923
The actual nested loops in the _scheme.py are just a catch all to handle lists and doesn't effect runtime. For practicality if we wanted to speedup this would be to focus on run_germline_assignment & hmmscan logic. We can remove the hmmscan time using hmmsearch in a later merge.
Tried a few optimization, but since the original code for get_identity is already plenty fast (~50 µs per run) I added a cache to the get_identy. Since state_sequence and germline_sequence are a finite amount of possibilities the overhead is dwarfed by the 10+ second reduced time with Multiprocessing off.
# Program: tests/unit/anarci/test_anarci.py
# Multi ON
3.899 <module> test_anarci.py:1
└─ 3.897 benchmark_anarci_multi_on test_anarci.py:714
└─ 3.890 run_multiple sadie/anarci/anarci.py:461
├─ 3.666 map multiprocessing/pool.py:359
│ └─ 3.666 get multiprocessing/pool.py:764
│ └─ 3.666 wait multiprocessing/pool.py:761
│ └─ 3.666 wait threading.py:582
│ └─ 3.666 wait threading.py:288
│ └─ 3.666 lock.acquire ../<built-in>:0
├─ 0.141 join multiprocessing/pool.py:656
│ └─ 0.141 join multiprocessing/process.py:142
│ └─ 0.141 wait multiprocessing/popen_fork.py:36
│ └─ 0.141 poll multiprocessing/popen_fork.py:24
│ └─ 0.141 waitpid ../<built-in>:0
└─ 0.043 Pool multiprocessing/context.py:115
# Multi OFF
21.448 <module> test_anarci.py:1
└─ 21.448 benchmark_anarci_multi_off test_anarci.py:757
└─ 21.439 run_multiple sadie/anarci/anarci.py:461
└─ 21.428 _run sadie/anarci/anarci.py:379
├─ 17.977 run_hmmer sadie/anarci/aa/_anarci.py:709
│ ├─ 13.666 communicate subprocess.py:1105
│ │ └─ 13.666 _communicate subprocess.py:1947
│ │ └─ 13.666 select selectors.py:403
│ │ └─ 13.666 poll.poll ../<built-in>:0
│ └─ 4.290 parse_hmmer_output sadie/anarci/aa/_anarci.py:685
│ └─ 4.094 __iter__ Bio/SearchIO/HmmerIO/hmmer3_text.py:44
│ └─ 4.093 _parse_qresult Bio/SearchIO/HmmerIO/hmmer3_text.py:98
│ ├─ 3.765 _parse_hit Bio/SearchIO/HmmerIO/hmmer3_text.py:160
│ │ └─ 3.642 _create_hits Bio/SearchIO/HmmerIO/hmmer3_text.py:217
│ │ ├─ 2.055 _parse_aln_block Bio/SearchIO/HmmerIO/hmmer3_text.py:328
│ │ │ ├─ 0.995 search re.py:197
│ │ │ │ ├─ 0.530 Pattern.search ../<built-in>:0
│ │ │ │ └─ 0.356 _compile re.py:288
│ │ │ │ └─ 0.304 [self]
│ │ │ ├─ 0.404 [self]
│ │ │ └─ 0.225 _query_set Bio/SearchIO/_model/hsp.py:942
│ │ │ └─ 0.216 _set_seq Bio/SearchIO/_model/hsp.py:877
│ │ ├─ 0.479 __init__ Bio/SearchIO/_model/hsp.py:754
│ │ │ └─ 0.351 [self]
│ │ └─ 0.247 [self]
│ └─ 0.316 __init__ Bio/SearchIO/_model/query.py:182
│ └─ 0.313 append Bio/SearchIO/_model/query.py:449
│ └─ 0.227 __setitem__ Bio/SearchIO/_model/query.py:336
├─ 2.148 _add_segment_regions sadie/anarci/result.py:73
│ └─ 2.132 apply pandas/core/frame.py:8583
│ └─ 2.132 apply pandas/core/apply.py:655
│ └─ 2.127 apply_standard pandas/core/apply.py:811
│ ├─ 1.598 apply_series_generator pandas/core/apply.py:817
│ │ └─ 1.371 <lambda> sadie/anarci/result.py:98
│ │ └─ 1.371 _get_region sadie/anarci/result.py:45
│ │ └─ 0.977 __init__ pandas/core/series.py:315
│ │ └─ 0.873 _init_dict pandas/core/series.py:451
│ │ └─ 0.866 create_series_with_explicit_dtype pandas/core/construction.py:800
│ │ └─ 0.845 __init__ pandas/core/series.py:315
│ │ └─ 0.352 ensure_index pandas/core/indexes/base.py:6279
│ │ └─ 0.339 __new__ pandas/core/indexes/base.py:375
│ │ └─ 0.242 __new__ pandas/core/indexes/base.py:375
│ └─ 0.528 wrap_results pandas/core/apply.py:836
│ └─ 0.528 wrap_results_for_axis pandas/core/apply.py:971
│ └─ 0.528 infer_to_same_shape pandas/core/apply.py:992
│ └─ 0.525 __init__ sadie/anarci/result.py:12
│ └─ 0.525 __init__ pandas/core/frame.py:573
│ └─ 0.525 dict_to_mgr pandas/core/internals/construction.py:396
│ └─ 0.441 arrays_to_mgr pandas/core/internals/construction.py:100
│ └─ 0.242 _homogenize pandas/core/internals/construction.py:560
│ └─ 0.227 reindex pandas/core/series.py:4572
│ └─ 0.221 reindex pandas/core/generic.py:4571
├─ 0.886 parsed_output sadie/anarci/aa/_anarci.py:290
│ └─ 0.653 __setitem__ pandas/core/series.py:1054
│ └─ 0.636 __setitem__ pandas/core/indexing.py:713
│ └─ 0.586 _setitem_with_indexer pandas/core/indexing.py:1595
│ └─ 0.578 _setitem_with_indexer_missing pandas/core/indexing.py:1971
│ └─ 0.221 __init__ pandas/core/series.py:315
└─ 0.368 number_sequences_from_alignment sadie/anarci/aa/_anarci.py:838
Since IMGT is default no additional tests need to be made, jut match existing ones.
mark todos on what is numbering scheme vs region scheme...
more to be added as I untangle the code