Closed CarlinLiao closed 3 months ago
I think you're getting a lot of overhead in your "for loop" case from the Pandas DataFrame wrapper and assignment access by index.
I added a comparison against a simple for loop, and also against the variant using python builtin map
.
The snippet seems to be missing some context (phenotypes
, integers
, imports), so I altered it slightly:
from typing import cast
from time import time
from pandas import DataFrame
from numpy import all as np_all
from numpy import array
from spatialprofilingtoolbox.ondemand.job_reference import ComputationJobReference
from spatialprofilingtoolbox.ondemand.providers.counts_provider import CountsProvider
def test_timing_counts():
study = 'Melanoma intralesional IL2'
sample = 'lesion 3_2'
j = ComputationJobReference(-1, study, sample)
p = CountsProvider(j)
print(f'Pulling data arrays for {sample}.')
arrays = p.get_cell_data_arrays()
features = tuple(n.symbol for n in arrays.feature_names.names)
positives_mask = CountsProvider._compute_signature(('CD3', 'CD4'), features)
negatives_mask = CountsProvider._compute_signature(('B2M',), features)
df_sample_for = DataFrame(arrays.location.T, index=arrays.identifiers, columns=['x', 'y'])
df_sample_vec = df_sample_for.copy()
print(f'Sample has {df_sample_for.shape[0]} cells.\n')
t0 = time()
for i, _entry in enumerate(arrays.phenotype):
entry = int(_entry)
df_sample_for.loc[arrays.identifiers[i], "anonymous phenotype"] = (entry | positives_mask == entry) and (~entry | negatives_mask == ~entry)
t1 = time()
difference = t1 - t0
print(f'For loop runtime (pandas DataFrame): {difference}s')
t0 = time()
values0 = []
entries = arrays.phenotype.astype(int)
for entry in entries:
values0.append((entry | positives_mask == entry) and (~entry | negatives_mask == ~entry))
t1 = time()
difference = t1 - t0
print(f' For loop runtime (direct): {difference}s')
t0 = time()
values = tuple(
map(
lambda entry: (entry | positives_mask == entry) and (~entry | negatives_mask == ~entry),
map(int, arrays.phenotype)
)
)
t1 = time()
difference = t1 - t0
print(f' Loop runtime (direct map): {difference}s')
t0 = time()
entry = arrays.phenotype.astype(int)
df_sample_vec["anonymous phenotype"] = (entry | positives_mask == entry) & (~entry | negatives_mask == ~entry)
t1 = time()
difference = t1 - t0
print(f' Implicitly vectorized runtime: {difference}s')
compare_12 = np_all(df_sample_for == df_sample_vec)
compare_13 = np_all(df_sample_for["anonymous phenotype"] == array(values0))
compare_14 = np_all(df_sample_for["anonymous phenotype"] == array(values))
print(f'Values are identical? {compare_12}; {compare_13}; {compare_14}')
if __name__=='__main__':
test_timing_counts()
You can run this on the production database by prepending the credentials:
SINGLE_CELL_DATABASE_HOST=... \
SINGLE_CELL_DATABASE_USER=... \
SINGLE_CELL_DATABASE_PASSWORD=... \
python snippet.py
On my machine the output is:
Pulling data arrays for lesion 3_2.
Sample has 353809 cells.
For loop runtime (pandas DataFrame): 26.20669460296631s
For loop runtime (direct): 0.0445859432220459s
Loop runtime (direct map): 0.047461509704589844s
Implicitly vectorized runtime: 0.003780841827392578s
Values are identical? True; True; True
In my run, "numpy vectorization" was faster than regular "for loop" or "map" by a factor of about 12, I suspect because it is using multiple cores. (My CPU has 24 cores).
In the production setup, we allocate vCPUs directly and use everything we allocate, and so we do not have access to this multi-core speedup trick.
Note that the numpy.vectorize
documentation warns that:
The vectorize function is provided primarily for convenience, not for performance. The
implementation is essentially a for loop.
I'm actually not finding much evidence that numpy's vectorization uses multiple cores. Rather, its speed is supposedly due to the use of precompiled C code for some low-level operation. If this is the case, this approximately 12 times speedup could be reproducible. I'll run more tests tomorrow.
I tried to track down the source code for the C extension in numpy that performs this "vectorization", but numpy uses a pretty elaborate "Universal Function" abstraction layer (could be here). Rather than hoping for this performance improvement based on a guess at what numpy is doing, if we will take advantage of this possibility I think we should do so by writing our own C component for these tasks, to keep the runtime operations simple (and debuggable).
Apparently a relatively simple way to do this is using ctypes
rather than a C extension per se. Since we can pass a byte array, we can avoid any translation cost from Python object to C data type:
None, integers, bytes objects and (unicode) strings are the only native Python objects
that can directly be used as parameters in these function calls. None is passed as a C
NULL pointer, bytes objects and strings are passed as pointer to the memory block
that contains their data (char* or wchar_t*).
In the process of adapting the cell data arrays from the
Provider
to grab phenotypes quickly, I've discovered that operations overCellDataArrays.phenotype
can be vectorized using numpy and still yield the same result. Numpy appears to support the same bitwise operations python3.12 does and can do so in a fraction of the runtime.Consider the following snippet:
Output