jwillis0720 / sadie

The Complete Python Antibody Library
MIT License
21 stars 3 forks source link

Anarci IMGT numbering Optimization #61

Closed tmsincomb closed 2 years ago

tmsincomb commented 2 years ago

Since IMGT is default no additional tests need to be made, jut match existing ones.

more to be added as I untangle the code

tmsincomb commented 2 years ago
# Program: tests/unit/anarci/test_anarci.py

# Multi ON

5.856 <module>  test_anarci.py:1
└─ 5.853 test_anarci_multi_on  test_anarci.py:714
   └─ 5.849 run_multiple  sadie/anarci/anarci.py:461
      ├─ 5.584 map  multiprocessing/pool.py:359
      │  └─ 5.584 get  multiprocessing/pool.py:764
      │     └─ 5.584 wait  multiprocessing/pool.py:761
      │        └─ 5.584 wait  threading.py:556
      │           └─ 5.584 wait  threading.py:280
      │              └─ 5.584 lock.acquire  ../<built-in>:0
      └─ 0.190 join  multiprocessing/pool.py:656
         └─ 0.190 join  multiprocessing/process.py:142
            └─ 0.190 wait  multiprocessing/popen_fork.py:36
               └─ 0.190 poll  multiprocessing/popen_fork.py:24
                  └─ 0.190 waitpid  ../<built-in>:0

# Multi OFF

35.126 <module>  test_anarci.py:1
└─ 35.124 test_anarci_multi_off  test_anarci.py:728
   └─ 35.121 run_multiple  sadie/anarci/anarci.py:461
      └─ 35.106 _run  sadie/anarci/anarci.py:379
         ├─ 16.705 number_sequences_from_alignment  sadie/anarci/aa/_anarci.py:838
         │  └─ 16.426 run_germline_assignment  sadie/anarci/aa/_anarci.py:943
         │     └─ 16.130 get_identity  sadie/anarci/aa/_anarci.py:923
         │        ├─ 13.402 [self]  
         │        └─ 2.682 str.upper  ../<built-in>:0
         ├─ 15.403 run_hmmer  sadie/anarci/aa/_anarci.py:709
         │  ├─ 9.257 communicate  subprocess.py:1090
         │  │  └─ 9.257 _communicate  subprocess.py:1926
         │  │     └─ 9.257 select  selectors.py:403
         │  │        └─ 9.257 poll.poll  ../<built-in>:0
         │  └─ 6.112 parse_hmmer_output  sadie/anarci/aa/_anarci.py:685
         │     └─ 5.879 __iter__  Bio/SearchIO/HmmerIO/hmmer3_text.py:44
         │        └─ 5.879 _parse_qresult  Bio/SearchIO/HmmerIO/hmmer3_text.py:98
         │           ├─ 5.331 _parse_hit  Bio/SearchIO/HmmerIO/hmmer3_text.py:160
         │           │  └─ 5.155 _create_hits  Bio/SearchIO/HmmerIO/hmmer3_text.py:217
         │           │     ├─ 2.865 _parse_aln_block  Bio/SearchIO/HmmerIO/hmmer3_text.py:328
         │           │     │  ├─ 1.492 search  re.py:198
         │           │     │  │  ├─ 0.854 Pattern.search  ../<built-in>:0
         │           │     │  │  └─ 0.496 _compile  re.py:289
         │           │     │  │     └─ 0.413 [self]  
         │           │     │  └─ 0.483 [self]  
         │           │     ├─ 0.744 __init__  Bio/SearchIO/_model/hsp.py:754
         │           │     │  └─ 0.551 [self]  
         │           │     └─ 0.379 [self]  
         │           └─ 0.526 __init__  Bio/SearchIO/_model/query.py:182
         │              └─ 0.524 append  Bio/SearchIO/_model/query.py:449
         │                 └─ 0.383 __setitem__  Bio/SearchIO/_model/query.py:336
         ├─ 1.650 _add_segment_regions  sadie/anarci/result.py:73
         │  └─ 1.642 apply  pandas/core/frame.py:8676
         │     └─ 1.642 apply  pandas/core/apply.py:694
         │        └─ 1.636 apply_standard  pandas/core/apply.py:850
         │           └─ 1.408 apply_series_generator  pandas/core/apply.py:856
         │              └─ 1.248 <lambda>  sadie/anarci/result.py:98
         │                 └─ 1.246 _get_region  sadie/anarci/result.py:45
         │                    └─ 0.933 __init__  pandas/core/series.py:323
         │                       └─ 0.850 _init_dict  pandas/core/series.py:463
         │                          └─ 0.843 create_series_with_explicit_dtype  pandas/core/construction.py:822
         │                             └─ 0.826 __init__  pandas/core/series.py:323
         │                                └─ 0.509 ensure_index  pandas/core/indexes/base.py:6987
         │                                   └─ 0.503 _with_infer  pandas/core/indexes/base.py:672
         │                                      └─ 0.406 __new__  pandas/core/indexes/base.py:397
         └─ 1.284 parsed_output  sadie/anarci/aa/_anarci.py:290
            └─ 0.948 __setitem__  pandas/core/series.py:1072
               └─ 0.883 __setitem__  pandas/core/indexing.py:705
                  └─ 0.830 _setitem_with_indexer  pandas/core/indexing.py:1553
                     └─ 0.821 _setitem_with_indexer_missing  pandas/core/indexing.py:1941
tmsincomb commented 2 years ago

We can see 16 secs of 1000 seq multi run, with parallel off, is just on

get_identity  sadie/anarci/aa/_anarci.py:923

The actual nested loops in the _scheme.py are just a catch all to handle lists and doesn't effect runtime. For practicality if we wanted to speedup this would be to focus on run_germline_assignment & hmmscan logic. We can remove the hmmscan time using hmmsearch in a later merge.

tmsincomb commented 2 years ago

Tried a few optimization, but since the original code for get_identity is already plenty fast (~50 µs per run) I added a cache to the get_identy. Since state_sequence and germline_sequence are a finite amount of possibilities the overhead is dwarfed by the 10+ second reduced time with Multiprocessing off.

# Program: tests/unit/anarci/test_anarci.py

# Multi ON

3.899 <module>  test_anarci.py:1
└─ 3.897 benchmark_anarci_multi_on  test_anarci.py:714
   └─ 3.890 run_multiple  sadie/anarci/anarci.py:461
      ├─ 3.666 map  multiprocessing/pool.py:359
      │  └─ 3.666 get  multiprocessing/pool.py:764
      │     └─ 3.666 wait  multiprocessing/pool.py:761
      │        └─ 3.666 wait  threading.py:582
      │           └─ 3.666 wait  threading.py:288
      │              └─ 3.666 lock.acquire  ../<built-in>:0
      ├─ 0.141 join  multiprocessing/pool.py:656
      │  └─ 0.141 join  multiprocessing/process.py:142
      │     └─ 0.141 wait  multiprocessing/popen_fork.py:36
      │        └─ 0.141 poll  multiprocessing/popen_fork.py:24
      │           └─ 0.141 waitpid  ../<built-in>:0
      └─ 0.043 Pool  multiprocessing/context.py:115

# Multi OFF

21.448 <module>  test_anarci.py:1
└─ 21.448 benchmark_anarci_multi_off  test_anarci.py:757
   └─ 21.439 run_multiple  sadie/anarci/anarci.py:461
      └─ 21.428 _run  sadie/anarci/anarci.py:379
         ├─ 17.977 run_hmmer  sadie/anarci/aa/_anarci.py:709
         │  ├─ 13.666 communicate  subprocess.py:1105
         │  │  └─ 13.666 _communicate  subprocess.py:1947
         │  │     └─ 13.666 select  selectors.py:403
         │  │        └─ 13.666 poll.poll  ../<built-in>:0
         │  └─ 4.290 parse_hmmer_output  sadie/anarci/aa/_anarci.py:685
         │     └─ 4.094 __iter__  Bio/SearchIO/HmmerIO/hmmer3_text.py:44
         │        └─ 4.093 _parse_qresult  Bio/SearchIO/HmmerIO/hmmer3_text.py:98
         │           ├─ 3.765 _parse_hit  Bio/SearchIO/HmmerIO/hmmer3_text.py:160
         │           │  └─ 3.642 _create_hits  Bio/SearchIO/HmmerIO/hmmer3_text.py:217
         │           │     ├─ 2.055 _parse_aln_block  Bio/SearchIO/HmmerIO/hmmer3_text.py:328
         │           │     │  ├─ 0.995 search  re.py:197
         │           │     │  │  ├─ 0.530 Pattern.search  ../<built-in>:0
         │           │     │  │  └─ 0.356 _compile  re.py:288
         │           │     │  │     └─ 0.304 [self]  
         │           │     │  ├─ 0.404 [self]  
         │           │     │  └─ 0.225 _query_set  Bio/SearchIO/_model/hsp.py:942
         │           │     │     └─ 0.216 _set_seq  Bio/SearchIO/_model/hsp.py:877
         │           │     ├─ 0.479 __init__  Bio/SearchIO/_model/hsp.py:754
         │           │     │  └─ 0.351 [self]  
         │           │     └─ 0.247 [self]  
         │           └─ 0.316 __init__  Bio/SearchIO/_model/query.py:182
         │              └─ 0.313 append  Bio/SearchIO/_model/query.py:449
         │                 └─ 0.227 __setitem__  Bio/SearchIO/_model/query.py:336
         ├─ 2.148 _add_segment_regions  sadie/anarci/result.py:73
         │  └─ 2.132 apply  pandas/core/frame.py:8583
         │     └─ 2.132 apply  pandas/core/apply.py:655
         │        └─ 2.127 apply_standard  pandas/core/apply.py:811
         │           ├─ 1.598 apply_series_generator  pandas/core/apply.py:817
         │           │  └─ 1.371 <lambda>  sadie/anarci/result.py:98
         │           │     └─ 1.371 _get_region  sadie/anarci/result.py:45
         │           │        └─ 0.977 __init__  pandas/core/series.py:315
         │           │           └─ 0.873 _init_dict  pandas/core/series.py:451
         │           │              └─ 0.866 create_series_with_explicit_dtype  pandas/core/construction.py:800
         │           │                 └─ 0.845 __init__  pandas/core/series.py:315
         │           │                    └─ 0.352 ensure_index  pandas/core/indexes/base.py:6279
         │           │                       └─ 0.339 __new__  pandas/core/indexes/base.py:375
         │           │                          └─ 0.242 __new__  pandas/core/indexes/base.py:375
         │           └─ 0.528 wrap_results  pandas/core/apply.py:836
         │              └─ 0.528 wrap_results_for_axis  pandas/core/apply.py:971
         │                 └─ 0.528 infer_to_same_shape  pandas/core/apply.py:992
         │                    └─ 0.525 __init__  sadie/anarci/result.py:12
         │                       └─ 0.525 __init__  pandas/core/frame.py:573
         │                          └─ 0.525 dict_to_mgr  pandas/core/internals/construction.py:396
         │                             └─ 0.441 arrays_to_mgr  pandas/core/internals/construction.py:100
         │                                └─ 0.242 _homogenize  pandas/core/internals/construction.py:560
         │                                   └─ 0.227 reindex  pandas/core/series.py:4572
         │                                      └─ 0.221 reindex  pandas/core/generic.py:4571
         ├─ 0.886 parsed_output  sadie/anarci/aa/_anarci.py:290
         │  └─ 0.653 __setitem__  pandas/core/series.py:1054
         │     └─ 0.636 __setitem__  pandas/core/indexing.py:713
         │        └─ 0.586 _setitem_with_indexer  pandas/core/indexing.py:1595
         │           └─ 0.578 _setitem_with_indexer_missing  pandas/core/indexing.py:1971
         │              └─ 0.221 __init__  pandas/core/series.py:315
         └─ 0.368 number_sequences_from_alignment  sadie/anarci/aa/_anarci.py:838