Closed CarlinLiao closed 1 year ago
After doing some digging, at least one major error lies with OnDemandProvider
.
CompressedMatrixWriter
has been refactored to save the compressed expression integers as a dict mapping histological_structure ID ints to the integers from what it was originally, just an undecorated list of integers.
OnDemandProvider
still expects the latter format when reading in the compressed matrices. Fixing this to handle the new format should resolve the issue, although I'm finding it tricky to do so because this bytes expression is difficult for me to understand. @jimmymathews Could you take it from here?
I think it'd be something like replacing file.read(8)
with something like file.readline()
, assuming the JSON output puts the ID and expression integer on the same line, splitting on b','
, and then doing the join join list reversed bin rjust operation on the second entry. (And also saving the structure ID for use as the new index.)
Thanks for implementing these improvements and doing speed testing. I'm actually a little surprised there is a speedup in moving to dataframe vectorization, because the previous thing was flat list iteration with no indexing overhead.
But then again, you did not do speed testing for the actual access operations (like count_structures_of_partial_signed_signature
), correct? (Since as you indicated it does not yet run.) Just the FeatureMatrixExtractor part? This is just data structure creation, so I wonder what makes it 30% faster?
The original purpose of CompressedDataArrays
was to reduce the memory footprint of the static file format, i.e. compression. I think saving each int together with a string identifier (they are not ints) will more than double the memory usage. This might be OK because it is not very large in absolute terms.
I'm not completely convinced of the overall advantage of abandoning the flat array of ints as the filesystem data structure. You can accomplish your goal of constructing the indexed Pandas dataframe by saving an additional file artifact, an array of the string identifiers. The performance is probably comparable, so what justifies the refactoring effort? The already-implemented ondemand services do not require the functionality you're proposing, and may even slow down by being forced to use it. (At some point Pandas must iterate over values for all-rows operations so it should be strictly slower to operate on the dataframe unless Pandas silently parallelizes, but I don't think it does this without being requested to do so.)
I think you may be thinking of the performance advantage of Pandas vector operations over Pandas row-wise operations, which is probably high in typical cases. Hence the usual "don't use iterrows" advice. However there is no reason to expected Pandas vector operations to be faster than primitive data structure operations.
* Unless there is some evidence of actual hardware vectorization (SIMD in NumPy, Stack Overflow)... after poking around it seems that it might not really be implemented in NumPy or Pandas for typical uses.
The snippet below shows Pandas vectorization to be about 10 times faster on my machine than primitive data structure for adding two columns with 1 million rows:
import time
import random
from pandas import DataFrame
def performance_timer(func):
def wrapper(argument):
start = time.time()
func(argument)
end = time.time()
print(f'{end-start} seconds for {func.__name__}.')
return wrapper
@performance_timer
def primitive_adding(_rows):
C = [row[0] + row[1] for row in _rows]
@performance_timer
def vector_pandas_adding(_df):
_df['C'] = _df['A'] + _df['B']
size = pow(10, 6)
rows = [(random.choice([1.5,2,3]), random.choice([5,6,7.5])) for i in range(size)]
df = DataFrame(rows, columns=['A','B'])
primitive_adding(rows)
vector_pandas_adding(df)
0.07324385643005371 seconds for primitive_adding.
0.007153034210205078 seconds for vector_pandas_adding.
So I stand corrected, the ondemand operations which can be reduced to arithmetic (I believe the counts service can) should be able to gain a lot from this implementation.
A little more testing shows that this advantage kicks in around row number = 10k.
It is pretty easy to update the centroid.pickle
artifact because the loader recreates it automatically if this file is not present.
Currently I'm adressing an issue where the centroids are all NaN except for the first sample, from the point of view of the Providers.
spt-ondemand-testing | lesion 0_2
spt-ondemand-testing | B2M B7H3 CD68 CD8 DAPI FOXP3 IDO1 KI67 LAG3 MHCI MHCII MRC1 CD14 PD1 PDL1 S100B SOX10 TGM2 TIM3 CD163 CD20 CD25 CD27 CD3 CD4 CD56 integer pixel x pixel y
spt-ondemand-testing | 0 0 0 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 100 NaN NaN
spt-ondemand-testing | 1 1 0 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 101 NaN NaN
spt-ondemand-testing | 2 0 1 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 102 NaN NaN
spt-ondemand-testing | 3 1 1 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 103 NaN NaN
...
Also the "integer" column looks like it's just an incrementing row index, not the int representation of the phenotypes data.
I think the binary expression files are written with trivial incrementing index ints. Here are the first few phenotypes expression values after decoding:
spt-ondemand-testing | lesion 0_1
spt-ondemand-testing | B2M B7H3 CD68 CD8 DAPI FOXP3 IDO1 KI67 LAG3 MHCI MHCII MRC1 CD14 PD1 PDL1 S100B SOX10 TGM2 TIM3 CD163 CD20 CD25 CD27 CD3 CD4 CD56 integer pixel x pixel y
spt-ondemand-testing | 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4935.0 12.0
spt-ondemand-testing | 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 5290.0 6.0
spt-ondemand-testing | 12 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 5320.0 59.0
spt-ondemand-testing | 23 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 4859.0 109.0
spt-ondemand-testing | 34 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 5135.0 127.0
spt-ondemand-testing | 45 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 5142.0 153.0
Also the dataframe in the provider class doesn't seem to actually implement the indexing on histological_structure
identifier that you proposed.
Do you mean the read-in of the artifact in Provider? I haven't implemented that because I wasn't able to understand the functionality, see here.
No, the creation of the artifact, which happens automatically on first use assuming these files are not all already present.
The basics of why the vectorized implementation is faster, as I understand it, is that Python, being meant as a glue language that does a lot of convenient things like pre-allocating memory, is slow while pandas, which uses C, the low-level performance language that allows for a finer control of memory, etc., is fast.
(pandas uses numpy for holding data.)
As for artifact creation, that could also be chalked up to me not fully understanding the startup to runtime workflow. Do you mean in OnDemandProvider
or CompressedMatrixWriter
?
Yeah these remarks are about expression_data_array.0.1.bin
etc. These are saved as cache files to speed loading of the services. They are saved by CompressedMatrixWriter
. In the snippet you reference, the loop for entry in data_array
is no longer the right thing when data_array
becomes a dict. The values are ignored. I'm fixing this up and related issues.
The lack of indexing of the dataframes on histological structure id was preventing your new assignment line ... = DataFrame.from_dict(... centroid stuff)
from taking effect, causing all those NaNs.
Oh I see. Every 8 bytes is a new entry, because each int is 64 bits so we don't need to think about newlines or anything.
Yes. (Commit message above should say "first 8 bytes" not "first byte")
Are all tests passing for you, I'm seeing some errors in fast_cache_assessment that don't affect whether or not the test passes, but I don't think they're intended.
[ ERROR ] ondemand.fast_cache_assessor:214: Study "Breast cancer IMC - measurement" not mentioned in centroids file.
And apiserver proximity module test fails. Investigating...
compute_proximity_metric_for_signature_pair
uses BallTree
which looks only at the underlying numpy array (no index use), and sends back results with 0-based integer indices that needed to be translated back to our histological structure IDs.
The test you are referring to includes a test of the error-generating capability in case some files are missing. Those error messages are good, they mean that the deficient condition can be detected.
I think it should be as easy as switching it from a bare index reference it iloc
to switch it to numerical indexing?
sum(mask2.iloc[index] for index in list(indices))
I don't think so, because the indices in Actually this is incorrect, the indices are defined in reference to the original data provided to the indices
are defined in reference to the argument to BallTree.query_radius
, which is only a subset of the cells
DataFrame with index.BallTree
constructor, so I think this is right.
I made the iloc change locally and the test passed.
All tests pass on my end. I think we're ready to merge.
I do have some ideas for more acceleration methods... if we could find some way to vectorize shapefile centroid calculation we'd be in business! Also, I wonder if using DataFrame.from_sql
would be faster than using cursor to make the query and then creating a DataFrame on the output. And you mentioned earlier that there's places to vectorize inside the Providers too?
My original thought about shapefiles back when I first wrote it was to cache a binary template concocted from the output of the shapefile library, since only a couple of numerical values change from shapefile to shapefile. This could then be used for reading and writing really fast.
I think the more reasonable and easier thing is to add a performance_tweaks.sql
table that just lists structure id, pixel x, and pixel y populated during data import, leaving the shapefiles around for the future's automated reasoners.
Now that the tests are passing I'm going to actually review the changes. We should consider also adding a test that touches upon the exact thing that the changes do.
>>> import numpy as np
>>> import numpy.typing as npt
>>> print(npt.NDArray)
numpy.ndarray[typing.Any, numpy.dtype[+ScalarType]]
>>> print(npt.NDArray[np.float64])
numpy.ndarray[typing.Any, numpy.dtype[numpy.float64]]
>>> NDArrayInt = npt.NDArray[np.int_]
>>> a: NDArrayInt = np.arange(10)
>>> def func(a: npt.ArrayLike) -> npt.NDArray[Any]:
... return np.array(a)
I took a crack at implementing a test, but it fails. I'm not sure if the failure is because of the test I wrote or the functionality of FME/SparseMatrixPuller
.
My thought process:
histological_structure
IDs are assigned in order, e.g., the last cell in 0.csv
has ID 99 or 100, and the first cell in the next specimen, 3.csv
, has ID 100 or 101.str
/VARCHAR
instead of int
as the Python process handles them, but that doesn't work either.How are the IDs stored? @jimmymathews
Directly investigating using print statements, '1', '5', '150', and '151' are all valid IDs, so I'm thinking it might be something with the way psycopg translate my tuple list of IDs to check for into the query... it seems to do more than just filling in the %s values. At the very least, it surrounds them with quotation marks.
By the way, I think I've asked this before but why are we using psycopg2 and not psycopg3?
All tests, including the new one checking if selection by cell ID works, pass.
Look good, ready to merge.
This PR makes two major changes:
SparseMatrixPuller
from vanilla Python for looping to create the compressed channel expression to use vectorized pandas DataFrame and numpy arrays, with the goal of speeding up calculation. This is partially pursuant to #175.histological_structure
IDs when pulling fromSparseMatrixPuller
.In addition to the above, I made a minor change are made to enable the major ones:
SparseMatrixPuller
, andStructureCentroidPuller
are refactored to index their output by histological_structure ID instead of a reset ID. This isn't strictly necessary but I felt that this was a good way to ensure that the centroids matched the expression arrays instead of relying solely on the order staying static. This also is used for functionality in #211.This PR brings something like a 30% speed up to the FME process, at the potential cost of more RAM and disk space used due to having to sustain DataFrames instead of vanilla python arrays and indexed by histological_structure ID, respectively.
feature matrix extraction
had similar runtimes, 2s to 2s.Melanoma CyTOF ICI
improved from 4m21s to 3m25sBreast cancer IMC
improved from 18m07s to 11m56sOutstanding issues
centroids.pickle
artifact that doesn't index on histological_structure ID like the changed classes now expect. @jimmymathews please follow the process you used to create the originaltest/ondemand/test_expression_data/
to create replacements that use the new format.Some other things I'd like to do if given the time
CompressedDataArrays
format as a NamedTuple instead of a modifiable dict. For compatibility, NamedTuples can be accessed as dicts if you donamedtuple._asdict()
, but I don't think this supports fields with spaces in them, as is the case with every single entry inCompressedDataArrays
.