Operations over cell data arrays can be vectorized

CarlinLiao commented 4 months ago

In the process of adapting the cell data arrays from the Provider to grab phenotypes quickly, I've discovered that operations over CellDataArrays.phenotype can be vectorized using numpy and still yield the same result. Numpy appears to support the same bitwise operations python3.12 does and can do so in a fraction of the runtime.

Consider the following snippet:

j = ComputationJobReference(-1, STUDY, sample)
p = CountsProvider(j)
arrays = p.get_cell_data_arrays()
features = tuple(n.symbol for n in arrays.feature_names.names)
signatures = {phenotype_name: tuple(map(lambda m: cast(int, p._compute_signature(m, features)),
                                        (phenotype.positive_markers, phenotype.negative_markers)))
              for phenotype_name, phenotype in phenotypes.items()}
df_sample_for = pd.DataFrame(arrays.location.T, index=arrays.identifiers, columns=['x', 'y'])
df_sample_vec = df_sample_for.copy()

for phenotype_name, (positives_mask, negatives_mask) in signatures.items():
    t0 = time()
    for i, _entry in enumerate(integers):
        entry = int(_entry)
        df_sample_for.loc[arrays.identifiers[i], phenotype_name] = \
            (entry | positives_mask == entry) and (~entry | negatives_mask == ~entry)
    print(f'  For loop runtime: {int((time() - t0) // 60)}m {int((time() - t0) % 60)}s')

    t0 = time()
    entry = arrays.phenotype.astype(int)
    df_sample_vec[phenotype_name] = (entry | positives_mask == entry) & (
        ~entry | negatives_mask == ~entry)
    print(f'Vectorized runtime: {int((time() - t0) // 60)}m {int((time() - t0) % 60)}s')
    break

print('Values are identical:', np.all(df_sample_for == df_sample_vec))
print(df_sample_for.head())

Output

  For loop runtime: 1m 39s
Vectorized runtime: 0m 0s
Values are identical: True
             x    y Hoechst
9729303  35460  510    True
9729304  35515  514   False
9729305  35440  519    True
9729306  35424  520    True
9729307  35357  529   False

jimmymathews commented 4 months ago

I think you're getting a lot of overhead in your "for loop" case from the Pandas DataFrame wrapper and assignment access by index. I added a comparison against a simple for loop, and also against the variant using python builtin map. The snippet seems to be missing some context (phenotypes, integers, imports), so I altered it slightly:

from typing import cast
from time import time

from pandas import DataFrame
from numpy import all as np_all
from numpy import array

from spatialprofilingtoolbox.ondemand.job_reference import ComputationJobReference
from spatialprofilingtoolbox.ondemand.providers.counts_provider import CountsProvider

def test_timing_counts():
    study = 'Melanoma intralesional IL2'
    sample = 'lesion 3_2'
    j = ComputationJobReference(-1, study, sample)
    p = CountsProvider(j)
    print(f'Pulling data arrays for {sample}.')
    arrays = p.get_cell_data_arrays()
    features = tuple(n.symbol for n in arrays.feature_names.names)
    positives_mask = CountsProvider._compute_signature(('CD3', 'CD4'), features)
    negatives_mask = CountsProvider._compute_signature(('B2M',), features)

    df_sample_for = DataFrame(arrays.location.T, index=arrays.identifiers, columns=['x', 'y'])
    df_sample_vec = df_sample_for.copy()

    print(f'Sample has {df_sample_for.shape[0]} cells.\n')

    t0 = time()
    for i, _entry in enumerate(arrays.phenotype):
        entry = int(_entry)
        df_sample_for.loc[arrays.identifiers[i], "anonymous phenotype"] = (entry | positives_mask == entry) and (~entry | negatives_mask == ~entry)
    t1 = time()
    difference = t1 - t0
    print(f'For loop runtime (pandas DataFrame): {difference}s')

    t0 = time()
    values0 = []
    entries = arrays.phenotype.astype(int)
    for entry in entries:
        values0.append((entry | positives_mask == entry) and (~entry | negatives_mask == ~entry))
    t1 = time()
    difference = t1 - t0
    print(f'          For loop runtime (direct): {difference}s')

    t0 = time()
    values = tuple(
        map(
            lambda entry: (entry | positives_mask == entry) and (~entry | negatives_mask == ~entry),
            map(int, arrays.phenotype)
        )
    )
    t1 = time()
    difference = t1 - t0
    print(f'          Loop runtime (direct map): {difference}s')

    t0 = time()
    entry = arrays.phenotype.astype(int)
    df_sample_vec["anonymous phenotype"] = (entry | positives_mask == entry) & (~entry | negatives_mask == ~entry)
    t1 = time()
    difference = t1 - t0
    print(f'      Implicitly vectorized runtime: {difference}s')

    compare_12 = np_all(df_sample_for == df_sample_vec)
    compare_13 = np_all(df_sample_for["anonymous phenotype"] == array(values0))
    compare_14 = np_all(df_sample_for["anonymous phenotype"] == array(values))
    print(f'Values are identical? {compare_12}; {compare_13}; {compare_14}')

if __name__=='__main__':
    test_timing_counts()

You can run this on the production database by prepending the credentials:

SINGLE_CELL_DATABASE_HOST=... \
SINGLE_CELL_DATABASE_USER=... \
SINGLE_CELL_DATABASE_PASSWORD=... \
python snippet.py

On my machine the output is:

Pulling data arrays for lesion 3_2.
Sample has 353809 cells.

For loop runtime (pandas DataFrame): 26.20669460296631s
          For loop runtime (direct): 0.0445859432220459s
          Loop runtime (direct map): 0.047461509704589844s
      Implicitly vectorized runtime: 0.003780841827392578s
Values are identical? True; True; True

In my run, "numpy vectorization" was faster than regular "for loop" or "map" by a factor of about 12, I suspect because it is using multiple cores. (My CPU has 24 cores).

In the production setup, we allocate vCPUs directly and use everything we allocate, and so we do not have access to this multi-core speedup trick.

Note that the numpy.vectorize documentation warns that:

The vectorize function is provided primarily for convenience, not for performance. The
implementation is essentially a for loop.

jimmymathews commented 4 months ago

I'm actually not finding much evidence that numpy's vectorization uses multiple cores. Rather, its speed is supposedly due to the use of precompiled C code for some low-level operation. If this is the case, this approximately 12 times speedup could be reproducible. I'll run more tests tomorrow.

jimmymathews commented 4 months ago

I tried to track down the source code for the C extension in numpy that performs this "vectorization", but numpy uses a pretty elaborate "Universal Function" abstraction layer (could be here). Rather than hoping for this performance improvement based on a guess at what numpy is doing, if we will take advantage of this possibility I think we should do so by writing our own C component for these tasks, to keep the runtime operations simple (and debuggable). Apparently a relatively simple way to do this is using ctypes rather than a C extension per se. Since we can pass a byte array, we can avoid any translation cost from Python object to C data type:

None, integers, bytes objects and (unicode) strings are the only native Python objects
that can directly be used as parameters in these function calls. None is passed as a C
NULL pointer, bytes objects and strings are passed as pointer to the memory block
that contains their data (char* or wchar_t*).

nadeemlab / SPT

Operations over cell data arrays can be vectorized #337