Vectorize FME pullers - Githubissues

CarlinLiao commented 1 year ago

This PR makes two major changes:

Converts SparseMatrixPuller from vanilla Python for looping to create the compressed channel expression to use vectorized pandas DataFrame and numpy arrays, with the goal of speeding up calculation. This is partially pursuant to #175.
Adds the ability to select for specific histological_structure IDs when pulling from SparseMatrixPuller.

In addition to the above, I made a minor change are made to enable the major ones:

SparseMatrixPuller, and StructureCentroidPuller are refactored to index their output by histological_structure ID instead of a reset ID. This isn't strictly necessary but I felt that this was a good way to ensure that the centroids matched the expression arrays instead of relying solely on the order staying static. This also is used for functionality in #211.

This PR brings something like a 30% speed up to the FME process, at the potential cost of more RAM and disk space used due to having to sustain DataFrames instead of vanilla python arrays and indexed by histological_structure ID, respectively.

The workflow tests on my WSL machine, feature matrix extraction had similar runtimes, 2s to 2s.
Using my M1 MacBook to querying the live server and process FME output via cggnn extract,
- Melanoma CyTOF ICI improved from 4m21s to 3m25s
- Breast cancer IMC improved from 18m07s to 11m56s

Outstanding issues

All FME tests pass, but the on demand test container fails to set up, I suspect because they're dependent on an outdated centroids.pickle artifact that doesn't index on histological_structure ID like the changed classes now expect. @jimmymathews please follow the process you used to create the original test/ondemand/test_expression_data/ to create replacements that use the new format.
The original reason why I did this was to allow for improvement 2 to speed up FME so #211 could stay a real-time query instead of a Provider-provided on demand function. This is still be too slow, though.

Some other things I'd like to do if given the time

Bake in the CompressedDataArrays format as a NamedTuple instead of a modifiable dict. For compatibility, NamedTuples can be accessed as dicts if you do namedtuple._asdict(), but I don't think this supports fields with spaces in them, as is the case with every single entry in CompressedDataArrays.
Allow selection of multiple specimens when pulling features from an array, like histological structures are.
Not all uses of FME interact with the centroids. Since a ton of time is spent processing shape files into centroids, this could represent a significant speedup if this option was toggleable.

CarlinLiao commented 1 year ago

After doing some digging, at least one major error lies with OnDemandProvider.

CompressedMatrixWriter has been refactored to save the compressed expression integers as a dict mapping histological_structure ID ints to the integers from what it was originally, just an undecorated list of integers.

https://github.com/nadeemlab/SPT/blob/6b434ec002089c888ed384160a89430d61c52953/spatialprofilingtoolbox/ondemand/compressed_matrix_writer.py#L86-L89

OnDemandProvider still expects the latter format when reading in the compressed matrices. Fixing this to handle the new format should resolve the issue, although I'm finding it tricky to do so because this bytes expression is difficult for me to understand. @jimmymathews Could you take it from here?

https://github.com/nadeemlab/SPT/blob/6b434ec002089c888ed384160a89430d61c52953/spatialprofilingtoolbox/ondemand/providers/provider.py#L105-L124

I think it'd be something like replacing file.read(8) with something like file.readline(), assuming the JSON output puts the ID and expression integer on the same line, splitting on b',', and then doing the join join list reversed bin rjust operation on the second entry. (And also saving the structure ID for use as the new index.)

jimmymathews commented 1 year ago

Thanks for implementing these improvements and doing speed testing. I'm actually a little surprised there is a speedup in moving to dataframe vectorization, because the previous thing was flat list iteration with no indexing overhead.

But then again, you did not do speed testing for the actual access operations (like count_structures_of_partial_signed_signature), correct? (Since as you indicated it does not yet run.) Just the FeatureMatrixExtractor part? This is just data structure creation, so I wonder what makes it 30% faster?

The original purpose of CompressedDataArrays was to reduce the memory footprint of the static file format, i.e. compression. I think saving each int together with a string identifier (they are not ints) will more than double the memory usage. This might be OK because it is not very large in absolute terms.

I'm not completely convinced of the overall advantage of abandoning the flat array of ints as the filesystem data structure. You can accomplish your goal of constructing the indexed Pandas dataframe by saving an additional file artifact, an array of the string identifiers. The performance is probably comparable, so what justifies the refactoring effort? The already-implemented ondemand services do not require the functionality you're proposing, and may even slow down by being forced to use it. (At some point Pandas must iterate over values for all-rows operations so it should be strictly slower to operate on the dataframe unless Pandas silently parallelizes, but I don't think it does this without being requested to do so.)

jimmymathews commented 1 year ago

I think you may be thinking of the performance advantage of Pandas vector operations over Pandas row-wise operations, which is probably high in typical cases. Hence the usual "don't use iterrows" advice. However there is no reason to expected Pandas vector operations to be faster than primitive data structure operations.

* Unless there is some evidence of actual hardware vectorization (SIMD in NumPy, Stack Overflow)... after poking around it seems that it might not really be implemented in NumPy or Pandas for typical uses.

jimmymathews commented 1 year ago

The snippet below shows Pandas vectorization to be about 10 times faster on my machine than primitive data structure for adding two columns with 1 million rows:


import time
import random
from pandas import DataFrame

def performance_timer(func):
    def wrapper(argument):
        start = time.time()
        func(argument)
        end = time.time()
        print(f'{end-start} seconds for {func.__name__}.')
    return wrapper

@performance_timer
def primitive_adding(_rows):
    C = [row[0] + row[1] for row in _rows]

@performance_timer
def vector_pandas_adding(_df):
    _df['C'] = _df['A'] + _df['B']

size = pow(10, 6)
rows = [(random.choice([1.5,2,3]), random.choice([5,6,7.5])) for i in range(size)]
df = DataFrame(rows, columns=['A','B'])

primitive_adding(rows)
vector_pandas_adding(df)

0.07324385643005371 seconds for primitive_adding.
0.007153034210205078 seconds for vector_pandas_adding.

So I stand corrected, the ondemand operations which can be reduced to arithmetic (I believe the counts service can) should be able to gain a lot from this implementation.

jimmymathews commented 1 year ago

A little more testing shows that this advantage kicks in around row number = 10k.

jimmymathews commented 1 year ago

It is pretty easy to update the centroid.pickle artifact because the loader recreates it automatically if this file is not present.

Currently I'm adressing an issue where the centroids are all NaN except for the first sample, from the point of view of the Providers.

spt-ondemand-testing  | lesion 0_2
spt-ondemand-testing  |     B2M  B7H3  CD68  CD8  DAPI  FOXP3  IDO1  KI67  LAG3  MHCI  MHCII  MRC1  CD14  PD1  PDL1  S100B  SOX10  TGM2  TIM3  CD163  CD20  CD25  CD27  CD3  CD4  CD56  integer  pixel x  pixel y
spt-ondemand-testing  | 0     0     0     1    0     0      1     1     0     0     0      0     0     0    0     0      0      0     0     0      0     0     0     0    0    0     0      100      NaN      NaN
spt-ondemand-testing  | 1     1     0     1    0     0      1     1     0     0     0      0     0     0    0     0      0      0     0     0      0     0     0     0    0    0     0      101      NaN      NaN
spt-ondemand-testing  | 2     0     1     1    0     0      1     1     0     0     0      0     0     0    0     0      0      0     0     0      0     0     0     0    0    0     0      102      NaN      NaN
spt-ondemand-testing  | 3     1     1     1    0     0      1     1     0     0     0      0     0     0    0     0      0      0     0     0      0     0     0     0    0    0     0      103      NaN      NaN
...

Also the "integer" column looks like it's just an incrementing row index, not the int representation of the phenotypes data.

jimmymathews commented 1 year ago

I think the binary expression files are written with trivial incrementing index ints. Here are the first few phenotypes expression values after decoding:

spt-ondemand-testing  | lesion 0_1
spt-ondemand-testing  |     B2M  B7H3  CD68  CD8  DAPI  FOXP3  IDO1  KI67  LAG3  MHCI  MHCII  MRC1  CD14  PD1  PDL1  S100B  SOX10  TGM2  TIM3  CD163  CD20  CD25  CD27  CD3  CD4  CD56  integer  pixel x  pixel y
spt-ondemand-testing  | 0     0     0     0    0     0      0     0     0     0     0      0     0     0    0     0      0      0     0     0      0     0     0     0    0    0     0        0   4935.0     12.0
spt-ondemand-testing  | 1     1     0     0    0     0      0     0     0     0     0      0     0     0    0     0      0      0     0     0      0     0     0     0    0    0     0        1   5290.0      6.0
spt-ondemand-testing  | 12    0     1     0    0     0      0     0     0     0     0      0     0     0    0     0      0      0     0     0      0     0     0     0    0    0     0        2   5320.0     59.0
spt-ondemand-testing  | 23    1     1     0    0     0      0     0     0     0     0      0     0     0    0     0      0      0     0     0      0     0     0     0    0    0     0        3   4859.0    109.0
spt-ondemand-testing  | 34    0     0     1    0     0      0     0     0     0     0      0     0     0    0     0      0      0     0     0      0     0     0     0    0    0     0        4   5135.0    127.0
spt-ondemand-testing  | 45    1     0     1    0     0      0     0     0     0     0      0     0     0    0     0      0      0     0     0      0     0     0     0    0    0     0        5   5142.0    153.0

jimmymathews commented 1 year ago

Also the dataframe in the provider class doesn't seem to actually implement the indexing on histological_structure identifier that you proposed.

CarlinLiao commented 1 year ago

Do you mean the read-in of the artifact in Provider? I haven't implemented that because I wasn't able to understand the functionality, see here.

jimmymathews commented 1 year ago

No, the creation of the artifact, which happens automatically on first use assuming these files are not all already present.

CarlinLiao commented 1 year ago

The basics of why the vectorized implementation is faster, as I understand it, is that Python, being meant as a glue language that does a lot of convenient things like pre-allocating memory, is slow while pandas, which uses C, the low-level performance language that allows for a finer control of memory, etc., is fast.

NumPy shines when you set up large arrays once and then perform many fast NumPy operations on the arrays. It may fail to outperform pure Python if the arrays are small because the setup cost can outweigh the benefit of offloading the calculations to compiled C/Fortran functions.

(pandas uses numpy for holding data.)

CarlinLiao commented 1 year ago

As for artifact creation, that could also be chalked up to me not fully understanding the startup to runtime workflow. Do you mean in OnDemandProvider or CompressedMatrixWriter?

jimmymathews commented 1 year ago

Yeah these remarks are about expression_data_array.0.1.bin etc. These are saved as cache files to speed loading of the services. They are saved by CompressedMatrixWriter. In the snippet you reference, the loop for entry in data_array is no longer the right thing when data_array becomes a dict. The values are ignored. I'm fixing this up and related issues.

jimmymathews commented 1 year ago

The lack of indexing of the dataframes on histological structure id was preventing your new assignment line ... = DataFrame.from_dict(... centroid stuff) from taking effect, causing all those NaNs.

CarlinLiao commented 1 year ago

Oh I see. Every 8 bytes is a new entry, because each int is 64 bits so we don't need to think about newlines or anything.

jimmymathews commented 1 year ago

Yes. (Commit message above should say "first 8 bytes" not "first byte")

CarlinLiao commented 1 year ago

Are all tests passing for you, I'm seeing some errors in fast_cache_assessment that don't affect whether or not the test passes, but I don't think they're intended.

[  ERROR  ] ondemand.fast_cache_assessor:214: Study "Breast cancer IMC - measurement" not mentioned in centroids file.

And apiserver proximity module test fails. Investigating...

jimmymathews commented 1 year ago

compute_proximity_metric_for_signature_pair uses BallTree which looks only at the underlying numpy array (no index use), and sends back results with 0-based integer indices that needed to be translated back to our histological structure IDs.

jimmymathews commented 1 year ago

The test you are referring to includes a test of the error-generating capability in case some files are missing. Those error messages are good, they mean that the deficient condition can be detected.

CarlinLiao commented 1 year ago

I think it should be as easy as switching it from a bare index reference it iloc to switch it to numerical indexing?

sum(mask2.iloc[index] for index in list(indices))

jimmymathews commented 1 year ago

~~I don't think so, because the indices in indices are defined in reference to the argument to BallTree.query_radius, which is only a subset of the cells DataFrame with index.~~ Actually this is incorrect, the indices are defined in reference to the original data provided to the BallTree constructor, so I think this is right.

CarlinLiao commented 1 year ago

I made the iloc change locally and the test passed.

CarlinLiao commented 1 year ago

All tests pass on my end. I think we're ready to merge.

I do have some ideas for more acceleration methods... if we could find some way to vectorize shapefile centroid calculation we'd be in business! Also, I wonder if using DataFrame.from_sql would be faster than using cursor to make the query and then creating a DataFrame on the output. And you mentioned earlier that there's places to vectorize inside the Providers too?

jimmymathews commented 1 year ago

My original thought about shapefiles back when I first wrote it was to cache a binary template concocted from the output of the shapefile library, since only a couple of numerical values change from shapefile to shapefile. This could then be used for reading and writing really fast. I think the more reasonable and easier thing is to add a performance_tweaks.sql table that just lists structure id, pixel x, and pixel y populated during data import, leaving the shapefiles around for the future's automated reasoners.

Now that the tests are passing I'm going to actually review the changes. We should consider also adding a test that touches upon the exact thing that the changes do.

CarlinLiao commented 1 year ago

FYI numpy has a dedicated typing module that allows you to specify the number type in the numpy array.

>>> import numpy as np
>>> import numpy.typing as npt

>>> print(npt.NDArray)
numpy.ndarray[typing.Any, numpy.dtype[+ScalarType]]

>>> print(npt.NDArray[np.float64])
numpy.ndarray[typing.Any, numpy.dtype[numpy.float64]]

>>> NDArrayInt = npt.NDArray[np.int_]
>>> a: NDArrayInt = np.arange(10)

>>> def func(a: npt.ArrayLike) -> npt.NDArray[Any]:
...     return np.array(a)

CarlinLiao commented 1 year ago

I took a crack at implementing a test, but it fails. I'm not sure if the failure is because of the test I wrote or the functionality of FME/SparseMatrixPuller.

My thought process:

I wrote the test assuming that histological_structure IDs are assigned in order, e.g., the last cell in 0.csv has ID 99 or 100, and the first cell in the next specimen, 3.csv, has ID 100 or 101.
Given this information from the index retained by FME, we should be able to retrieve the original rows from the CSV to compare against the FME output.
- As written we should find 4 cells total, 2 in each specimen, with ID 2000 simply not returning a corresponding cell instead of causing an error.
But, the test is failing inside FME because the SQL query results are empty instead of having 2 cells in them.
- This suggests that I may be indexing wrong.
- or I wrote the WHERE condition in SQL incorrectly
  - I thought it might be because the structure IDs are stored as str/VARCHAR instead of int as the Python process handles them, but that doesn't work either.

How are the IDs stored? @jimmymathews

CarlinLiao commented 1 year ago

Directly investigating using print statements, '1', '5', '150', and '151' are all valid IDs, so I'm thinking it might be something with the way psycopg translate my tuple list of IDs to check for into the query... it seems to do more than just filling in the %s values. At the very least, it surrounds them with quotation marks.

CarlinLiao commented 1 year ago

By the way, I think I've asked this before but why are we using psycopg2 and not psycopg3?

CarlinLiao commented 1 year ago

Turns out parameter replacement is handled by some logic within psycopg2, which is why things got so weird.

CarlinLiao commented 1 year ago

All tests, including the new one checking if selection by cell ID works, pass.

jimmymathews commented 1 year ago

Look good, ready to merge.

nadeemlab / SPT

Vectorize FME pullers #212