scverse / scirpy

A scanpy extension to analyse single-cell TCR and BCR data.
https://scirpy.scverse.org/en/latest/
BSD 3-Clause "New" or "Revised" License
220 stars 34 forks source link

Improve parallelism of ir_dist #473

Closed grst closed 10 months ago

grst commented 10 months ago

Closes #468

codecov[bot] commented 10 months ago

Codecov Report

Attention: 4 lines in your changes are missing coverage. Please review.

Comparison is base (bd168a9) 80.52% compared to head (5255317) 80.50%.

Files Patch % Lines
src/scirpy/util/__init__.py 62.50% 3 Missing :warning:
src/scirpy/ir_dist/metrics.py 91.66% 1 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #473 +/- ## ========================================== - Coverage 80.52% 80.50% -0.02% ========================================== Files 49 49 Lines 3994 4012 +18 ========================================== + Hits 3216 3230 +14 - Misses 778 782 +4 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

grst commented 10 months ago

In general this works well (also out-of-machine computing with dask), but:

grst commented 10 months ago

for some reason it takes ~15 seconds to spawn the loky worker threads. When rerunning the function, this is instant. Not sure what the threads are doing initially.

the reason was that the worker threads all import scirpy from scratch. It turned out to be quite slow because of import airr. I now deferred importing the airr package until it's needed (only in some IO functions) and this brought down the import time to ~1s which is acceptable and mostly due to scanpy.

grst commented 10 months ago

@felixpetschko fyi

In this PR I change how the ParallelDistanceCalculator handles parallel processing. It now uses joblib.Parallel that (amongst others) allows to use dask as a backend for multi-machine computing.

But I think your reimplementation of the hamming distance calculator anyway doesn't need this.