Closed CarlinLiao closed 1 year ago
Stray thoughts not solely related to this PR:
The stray thought about cell-set-specific version of FME functionality pertains to this topic, since such functionality is needed to get the most-important-cells' features efficiently. Since the cell sets involved are small, this is an ondemand functionality that could be effectively provided directly by the API server (rather than those specialized ondemand services).
The cell-set-specific version of FME is sufficiently distinct from the sample- or study-specific version currently implemented, such that it might be worth a separate module... rather than trying to mutate the FME implementation even further to support such fine granularity. Like a CellSetFeatureExtractor
.
What is the different handling you're referring to for ValidChannelListPositives
, ValidChannelListNegatives
, ...?
The implementation here uses FeatureMatrixExtractor
to pull from the database, and as you've commented in the code this would be more efficient if it were cell-set-specific.
I think I actually might recommend a different implementation for performance reasons: Altering one of the onedemand services. How about updating count_structures_of_partial_signed_signature to accept an optional specimen-to-histological-structure-ids argument. Ultimately the change would be in get_count:
In case the cell set is provided as a new non-trivial argument, the loop for entry in integers
would be modified to an indexed loop with an extra conditional with something like ... and entry_index in cell_set
.
(In this case, we might need to update the TCP client/server pattern here to use a proper JSON payload, rather than the custom string format with record separators etc. Otherwise it would be too difficult to provide the cell sets data from the client call.)
A thought not directly related to this PR: we advertise the ADI/SPT stack as being database software-agnostic, but a lot of the SPT backend is SQL queries, specifically PostgreSQL queries that might not work in other SQL dialects, never mind another format like MongoDB. Or is it just the single-cell schema that we're saying isn't tied solely to SQL?
The bottleneck here in FME is probably just as much about having to calculate the centroids of each cell from the shapefile as it is having to aggregate all the different channels' information. That's not necessary for this command so we could just toggle that off in extract
.
However, most of the time is spent pulling millions of cell-channel records from the database, so switching to using the ondemand artifacts is probably the way to go.
Another note: why are some API endpoints in app/main called "get" and others "requests"? There doesn't seem to be a real difference between the two. Perhaps there could be a refactor where one is meant to be instantaneous and the other is supposed to wait on a Provider respone.
request
endpoints request forking a background computation that typically takes more than a few seconds. These are the ones with pending
indicators in the response. That was supposed to be the point of this terminology distinction.A stray note: In all the datasets we have dealt with so far, I have enforced the condition that every sample should have the same set of measured channels available, and for this reason the "negative" or "0" expression value records for these channels is redundant. The expression_quantification
table could be actually sparse as opposed to now, where we have tons of 0 values recorded explicitly.
About the Provider
-route implementation: I took a crack at it and have realized that #212 is a recommended prerequisite. It's difficult to select out a specific cell subset unless the histological structure ID is saved as the index of the expression artifacts.
Why is this a list and not a tuple? I think it might be for JSON reasons because it gets returned via web, but I'm not sure. Would a tuple break this functionality?
All tests pass in the latest commit. However, the important cell subselection won't actually work until #212 is merged with this branch, as the artifact that the CountsProvider reads in still needs to be updated to index the compressed expression integers by histological_structure ID.
I think this should work as is, but I need to write a new test for it.
In the meantime, I'm trying to run the existing test suite on the current state. I think they all work, but I can't get them to all work in one make test
run. The operation hangs on the proximity pipeline
workflow
module test, which I don't think should be affected by the new changes. That and I ran make module-test-workflow
with no failures.
This isn't the first time I've had a full test run hang on proximity pipeline
, so I think it's an environment thing outside of the changes. Any idea what could be going on?
Are you sure it's hanging? It's long running. Maybe 1.5 minutes or something.
Usually VERBOSE=1
shows what's happening if there is a real hang.
It was running when I commented an hour ago and it's still running. It's also un-cancel-able. I have to close the entire terminal.
By the way, sometimes one module or the other will say env not setup
because it took more than 30 attempts to connect, but the env just took more than that amount of time to connect instead of not connecting at all.
After restarting my entire computer and updating docker, all currently written tests pass. Next, the new test. How do I upload a dummy importance ranking to the test database? Something in build/
I assume?
There are several test database images, with initialization scripts located at build/build_scripts/import_test_dataset*.sh
.
You could add some lines to the scripts corresponding to the images that are used in your new test. I guess you would run spt cggnn upload ...
there. BTW you should update pyproject.toml
with the new scripts for the cggnn module.
Then you would probably run make force-rebuild-data-loaded-images
at least once.
This latest commit adds
spt cggnn upload-importance
to the build scripts, and updated an expected_table_counts.txt
to match the expected output. (I've skipped adding cggnn upload
to build scripts where both datasets 1 and 2 are uploaded because in that situation there would need to be some logic to modify the importance score CSV's histological_structure
ID of the dataset that's second to upload.)But it fails during the build process. I may have placed the upload-importance
command in the wrong place, or there could be an issue with the implementation.
[ INFO ] db.create_data_analysis_study: Inserted data analysis study: "Melanoma intralesional IL2 - measurement : cell importance : 2023-09-28 20:29:16.166202"
/usr/local/lib/python3.11/dist-packages/spatialprofilingtoolbox/db/importance_score_transcriber.py:69: UserWarning: pandas only supports SQLAlchemy connectable (engine/connection) or database string URI or sqlite3 DBAPI2 connection. Other DBAPI2 objects are not tested. Please consider using SQLAlchemy.
lookup = read_sql("""
Traceback (most recent call last):
File "/usr/local/lib/python3.11/dist-packages/spatialprofilingtoolbox/cggnn/scripts/upload.py", line 43, in <module>
transcribe_importance(df, connection, cohort_stratifier=args.cohort_stratifier)
File "/usr/local/lib/python3.11/dist-packages/spatialprofilingtoolbox/db/importance_score_transcriber.py", line 38, in transcribe_importance
_add_slide_column(connection, df)
File "/usr/local/lib/python3.11/dist-packages/spatialprofilingtoolbox/db/importance_score_transcriber.py", line 81, in _add_slide_column
df['specimen'] = reindexed.loc[df.index, 'specimen']
~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/pandas/core/indexing.py", line 1097, in __getitem__
return self._getitem_tuple(key)
^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/pandas/core/indexing.py", line 1280, in _getitem_tuple
return self._getitem_lowerdim(tup)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/pandas/core/indexing.py", line 1024, in _getitem_lowerdim
return getattr(section, self.name)[new_key]
~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/pandas/core/indexing.py", line 1103, in __getitem__
return self._getitem_axis(maybe_callable, axis=axis)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/pandas/core/indexing.py", line 1332, in _getitem_axis
return self._getitem_iterable(key, axis=axis)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/pandas/core/indexing.py", line 1272, in _getitem_iterable
keyarr, indexer = self._get_listlike_indexer(key, axis)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/pandas/core/indexing.py", line 1462, in _get_listlike_indexer
keyarr, indexer = ax._get_indexer_strict(key, axis_name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/pandas/core/indexes/base.py", line 5876, in _get_indexer_strict
self._raise_if_missing(keyarr, indexer, axis_name)
File "/usr/local/lib/python3.11/dist-packages/pandas/core/indexes/base.py", line 5935, in _raise_if_missing
raise KeyError(f"None of [{key}] are in the [{axis_name}]")
KeyError: "None of [Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,\n ...\n 690, 691, 692, 693, 694, 695, 696, 697, 698, 699],\n dtype='int64', name='histological_structure', length=700)] are in the [index]"
7c7
The error is pretty clear, in df['specimen'] = reindexed.loc[df.index, 'specimen']
the indices provided in df.index
(plus 'specimen'
?) are not found in the index of reindexed
. Following up on this I see that df
came from read_csv
and pandas assumed that the values were integers, converting them to int64
, whereas the SQL query forming reindexed
comes back with str
values.
Whew, I was afraid your last commit did what I already did and was going to push.
My last commits quashes every error from old tests. The new one is still having issues though. The query I wrote to retrieve the IDs of important histological structures isn't returning anything for some reason. Using self.cursor.mogrify
I can confirm that the query appears to be setting up correctly, but it isn't finding anything for some reason. As far as I can tell, this should be hitting test-data-loaded1
that has importances uploaded, right? But there's nothing in the feature table.
Example query I inserted to see what the tables had in them. No rows were printed from output:
spt-apiserver-testing | SELECT
spt-apiserver-testing | *
spt-apiserver-testing | FROM quantitative_feature_value qfv
spt-apiserver-testing | JOIN feature_specification fs
spt-apiserver-testing | ON fs.identifier=qfv.feature
spt-apiserver-testing | WHERE fs.derivation_method='For a given cohort stratification variable (the specifier), the integer rank of each cell (the subjects of the feature) with respect to the importance scores derived from a GNN trained on this variable.'
spt-apiserver-testing | AND fs.study='Melanoma intralesional IL2 - data analysis'
spt-apiserver-testing | -- AND qfv.value < 50 -- Importance is 0-indexed
spt-apiserver-testing | ;
test_transcribe_importance
passes for me. Perhaps you need to do:
make force-rebuild-data-loaded-images
That test works, I mean the new module test in apiserver to retrieve importance cell counts for a given phenotype, test_count_importance
.
I changed the two total expected counts values hard coded in the test, since the test seemed to be pulling non-trivial values correctly. I didn't manually verify that the changed values are correct (i.e. by looking at the original CSV files).
Are the feature values actually integer ranks? It seems the uploaded uploads values straight from CSV, and the examples committed are floats.
I think this could cause the selection limitation with cell_limit=...
to fail.
Edit: I see that _group_and_filter
is supposed to create the importance order ints.
I think your SQL query restriction to components.analysis
data analysis study is not what is intended.
The get_study_components
convenience function gets "the" primary data analysis study, whereas the importance score upload explicitly used its own custom data analysis study.
What's the best way to retrieve the custom CGGNN data analysis study? Could this be added to StudyAccess
?
And yes, the upload importance score functions translate float importance scores into int rankings.
I think I spotted the issue. Here is study_component
:
('Melanoma intralesional IL2', 'Melanoma intralesional IL2 - measurement')
('Melanoma intralesional IL2', 'Melanoma intralesional IL2 - specimen collection')
('Melanoma intralesional IL2', 'Melanoma intralesional IL2 - data analysis')
('Melanoma intralesional IL2', 'Melanoma intralesional IL2 : phenotype fractions : 2023-09-29 18:04:52.839308')
('Melanoma intralesional IL2 - measurement', 'Melanoma intralesional IL2 - measurement : cell importance : 2023-09-29 18:05:29.221342')
('Melanoma intralesional IL2', 'Melanoma intralesional IL2 - ondemand computed features')
The function that infers the study from a dataframe doesn't infer all the way up to the main study. Working on it.
Since the API endpoint doesn't support selection of which iteration of GNN runs to retrieve importance scores for, I think it's safe for now to assume there will be just one CGGNN substudy (even though the DB disambiguates in theory, as there is 1 specifier available for identifying the cohort stratification scheme used before training the GNN). So it's fine to just select by derivation_method
like we are doing now.
The update to the main counts-retrieval function to accept cell sets breaks the simple caching functionality that saved a lot of time. This caching is implemented as a decorator that checks whether the arguments have been seen before, but if the arguments include a set
, the arguments tuple is not hashable. For this reason I'm rolling this back to a list or tuple.
Doesn't this section:
set the importance rank integer globally, not per specimen?
It seems that the TCP transmission is getting arbitrarily truncated. In the query to the counts service I see all 50 * 7 cell IDs in the payload, but in the query received there is only a portion of them.
Well darn, there was a hardcoded byte limit for these TCP requests. I guess I was trying to limit the risk of a security issue. I raised the limit.
Now everything is pretty close, I'm just seeing slight discrepancies in the counts for each specimen. Plus or minus 2 or 3 cells. Will follow up next week.
Good catch on the importance_order
calculation. Not only did it set the rank globally instead of per specimen, it also gave the smallest scores the highest rank instead of the largest. I fixed it based on this StackOverflow post, and did a little code golf to avoid grouping twice, once to get the higher ranked values and again to calculate the exact ranks. It does mean we're calculating ranks for cells we're not returning, but since it's simple addition I doubt it's much of a drag. The CSVs are so simple the whole thing probably runs in trivial time anyway, even if we did group it twice.
All tests pass, including test for each specimen's count for those top 50.
Ready to merge?
Pursuant to #209.