Get fraction of important cells expressing phenotype

CarlinLiao commented 1 year ago

Pursuant to #209.

CarlinLiao commented 1 year ago

Stray thoughts not solely related to this PR:

for this usage, and perhaps others, FME could be more efficient if we could give it a list of histological structure IDs to create DataFrames for, instead of getting them for every histological structure ID in a study or specimen.
In app/main, ValidChannelList objects are handled differently for different endpoints, even though they should ostensibly be the same thing. Do you think this would benefit from consistency?

jimmymathews commented 1 year ago

The stray thought about cell-set-specific version of FME functionality pertains to this topic, since such functionality is needed to get the most-important-cells' features efficiently. Since the cell sets involved are small, this is an ondemand functionality that could be effectively provided directly by the API server (rather than those specialized ondemand services). The cell-set-specific version of FME is sufficiently distinct from the sample- or study-specific version currently implemented, such that it might be worth a separate module... rather than trying to mutate the FME implementation even further to support such fine granularity. Like a CellSetFeatureExtractor.

What is the different handling you're referring to for ValidChannelListPositives, ValidChannelListNegatives, ...?

jimmymathews commented 1 year ago

The implementation here uses FeatureMatrixExtractor to pull from the database, and as you've commented in the code this would be more efficient if it were cell-set-specific.

I think I actually might recommend a different implementation for performance reasons: Altering one of the onedemand services. How about updating count_structures_of_partial_signed_signature to accept an optional specimen-to-histological-structure-ids argument. Ultimately the change would be in get_count:

https://github.com/nadeemlab/SPT/blob/ef97a8e2b991a99e57b9072a64b4b91a95364f92/spatialprofilingtoolbox/ondemand/providers/counts_provider.py#L45-L54

In case the cell set is provided as a new non-trivial argument, the loop for entry in integers would be modified to an indexed loop with an extra conditional with something like ... and entry_index in cell_set.

(In this case, we might need to update the TCP client/server pattern here to use a proper JSON payload, rather than the custom string format with record separators etc. Otherwise it would be too difficult to provide the cell sets data from the client call.)

CarlinLiao commented 1 year ago

A thought not directly related to this PR: we advertise the ADI/SPT stack as being database software-agnostic, but a lot of the SPT backend is SQL queries, specifically PostgreSQL queries that might not work in other SQL dialects, never mind another format like MongoDB. Or is it just the single-cell schema that we're saying isn't tied solely to SQL?

CarlinLiao commented 1 year ago

The bottleneck here in FME is probably just as much about having to calculate the centroids of each cell from the shapefile as it is having to aggregate all the different channels' information. That's not necessary for this command so we could just toggle that off in extract.

However, most of the time is spent pulling millions of cell-channel records from the database, so switching to using the ondemand artifacts is probably the way to go.

CarlinLiao commented 1 year ago

Another note: why are some API endpoints in app/main called "get" and others "requests"? There doesn't seem to be a real difference between the two. Perhaps there could be a refactor where one is meant to be instantaneous and the other is supposed to wait on a Provider respone.

jimmymathews commented 1 year ago

I don't think we claim to be SQL-database-implementation-agnostic. Everything has been Postgres from the beginning.
I believe the request endpoints request forking a background computation that typically takes more than a few seconds. These are the ones with pending indicators in the response. That was supposed to be the point of this terminology distinction.

A stray note: In all the datasets we have dealt with so far, I have enforced the condition that every sample should have the same set of measured channels available, and for this reason the "negative" or "0" expression value records for these channels is redundant. The expression_quantification table could be actually sparse as opposed to now, where we have tons of 0 values recorded explicitly.

CarlinLiao commented 1 year ago

About the Provider-route implementation: I took a crack at it and have realized that #212 is a recommended prerequisite. It's difficult to select out a specific cell subset unless the histological structure ID is saved as the index of the expression artifacts.

CarlinLiao commented 1 year ago

Why is this a list and not a tuple? I think it might be for JSON reasons because it gets returned via web, but I'm not sure. Would a tuple break this functionality?

https://github.com/nadeemlab/SPT/blob/907a9eca735180dc5a755820a2388e982669104e/spatialprofilingtoolbox/ondemand/providers/counts_provider.py#L42

CarlinLiao commented 1 year ago

All tests pass in the latest commit. However, the important cell subselection won't actually work until #212 is merged with this branch, as the artifact that the CountsProvider reads in still needs to be updated to index the compressed expression integers by histological_structure ID.

https://github.com/nadeemlab/SPT/blob/00aaad0ddacd41e4326ba364a107de97f4d064cf/spatialprofilingtoolbox/ondemand/providers/counts_provider.py#L28-L49

CarlinLiao commented 1 year ago

I think this should work as is, but I need to write a new test for it.

In the meantime, I'm trying to run the existing test suite on the current state. I think they all work, but I can't get them to all work in one make test run. The operation hangs on the proximity pipeline workflow module test, which I don't think should be affected by the new changes. That and I ran make module-test-workflow with no failures.

This isn't the first time I've had a full test run hang on proximity pipeline, so I think it's an environment thing outside of the changes. Any idea what could be going on?

jimmymathews commented 1 year ago

Are you sure it's hanging? It's long running. Maybe 1.5 minutes or something.

Usually VERBOSE=1 shows what's happening if there is a real hang.

CarlinLiao commented 1 year ago

It was running when I commented an hour ago and it's still running. It's also un-cancel-able. I have to close the entire terminal.

By the way, sometimes one module or the other will say env not setup because it took more than 30 attempts to connect, but the env just took more than that amount of time to connect instead of not connecting at all.

CarlinLiao commented 1 year ago

After restarting my entire computer and updating docker, all currently written tests pass. Next, the new test. How do I upload a dummy importance ranking to the test database? Something in build/ I assume?

jimmymathews commented 1 year ago

There are several test database images, with initialization scripts located at build/build_scripts/import_test_dataset*.sh. You could add some lines to the scripts corresponding to the images that are used in your new test. I guess you would run spt cggnn upload ... there. BTW you should update pyproject.toml with the new scripts for the cggnn module.

Then you would probably run make force-rebuild-data-loaded-images at least once.

CarlinLiao commented 1 year ago

This latest commit adds

fake importance scores to the three tests datasets. (They're actually just random values between 0 and 1.)
spt cggnn upload-importance to the build scripts, and updated an expected_table_counts.txt to match the expected output. (I've skipped adding cggnn upload to build scripts where both datasets 1 and 2 are uploaded because in that situation there would need to be some logic to modify the importance score CSV's histological_structure ID of the dataset that's second to upload.)
a new test of the phenotypes and importance API endpoint

But it fails during the build process. I may have placed the upload-importance command in the wrong place, or there could be an issue with the implementation.

[  INFO   ] db.create_data_analysis_study: Inserted data analysis study: "Melanoma intralesional IL2 - measurement : cell importance : 2023-09-28 20:29:16.166202"
/usr/local/lib/python3.11/dist-packages/spatialprofilingtoolbox/db/importance_score_transcriber.py:69: UserWarning: pandas only supports SQLAlchemy connectable (engine/connection) or database string URI or sqlite3 DBAPI2 connection. Other DBAPI2 objects are not tested. Please consider using SQLAlchemy.
  lookup = read_sql("""
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/dist-packages/spatialprofilingtoolbox/cggnn/scripts/upload.py", line 43, in <module>
    transcribe_importance(df, connection, cohort_stratifier=args.cohort_stratifier)
  File "/usr/local/lib/python3.11/dist-packages/spatialprofilingtoolbox/db/importance_score_transcriber.py", line 38, in transcribe_importance
    _add_slide_column(connection, df)
  File "/usr/local/lib/python3.11/dist-packages/spatialprofilingtoolbox/db/importance_score_transcriber.py", line 81, in _add_slide_column
    df['specimen'] = reindexed.loc[df.index, 'specimen']
                     ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/pandas/core/indexing.py", line 1097, in __getitem__
    return self._getitem_tuple(key)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/pandas/core/indexing.py", line 1280, in _getitem_tuple
    return self._getitem_lowerdim(tup)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/pandas/core/indexing.py", line 1024, in _getitem_lowerdim
    return getattr(section, self.name)[new_key]
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/pandas/core/indexing.py", line 1103, in __getitem__
    return self._getitem_axis(maybe_callable, axis=axis)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/pandas/core/indexing.py", line 1332, in _getitem_axis
    return self._getitem_iterable(key, axis=axis)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/pandas/core/indexing.py", line 1272, in _getitem_iterable
    keyarr, indexer = self._get_listlike_indexer(key, axis)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/pandas/core/indexing.py", line 1462, in _get_listlike_indexer
    keyarr, indexer = ax._get_indexer_strict(key, axis_name)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/pandas/core/indexes/base.py", line 5876, in _get_indexer_strict
    self._raise_if_missing(keyarr, indexer, axis_name)
  File "/usr/local/lib/python3.11/dist-packages/pandas/core/indexes/base.py", line 5935, in _raise_if_missing
    raise KeyError(f"None of [{key}] are in the [{axis_name}]")
KeyError: "None of [Index([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,\n       ...\n       690, 691, 692, 693, 694, 695, 696, 697, 698, 699],\n      dtype='int64', name='histological_structure', length=700)] are in the [index]"
7c7

jimmymathews commented 1 year ago

The error is pretty clear, in df['specimen'] = reindexed.loc[df.index, 'specimen'] the indices provided in df.index (plus 'specimen'?) are not found in the index of reindexed. Following up on this I see that df came from read_csv and pandas assumed that the values were integers, converting them to int64, whereas the SQL query forming reindexed comes back with str values.

CarlinLiao commented 1 year ago

Whew, I was afraid your last commit did what I already did and was going to push.

My last commits quashes every error from old tests. The new one is still having issues though. The query I wrote to retrieve the IDs of important histological structures isn't returning anything for some reason. Using self.cursor.mogrify I can confirm that the query appears to be setting up correctly, but it isn't finding anything for some reason. As far as I can tell, this should be hitting test-data-loaded1 that has importances uploaded, right? But there's nothing in the feature table.

Example query I inserted to see what the tables had in them. No rows were printed from output:

spt-apiserver-testing  |             SELECT
spt-apiserver-testing  |                 *
spt-apiserver-testing  |             FROM quantitative_feature_value qfv
spt-apiserver-testing  |                 JOIN feature_specification fs
spt-apiserver-testing  |                     ON fs.identifier=qfv.feature
spt-apiserver-testing  |             WHERE fs.derivation_method='For a given cohort stratification variable (the specifier), the integer rank of each cell (the subjects of the feature) with respect to the importance scores derived from a GNN trained on this variable.'
spt-apiserver-testing  |                 AND fs.study='Melanoma intralesional IL2 - data analysis'
spt-apiserver-testing  |                 -- AND qfv.value < 50  -- Importance is 0-indexed
spt-apiserver-testing  |             ;

jimmymathews commented 1 year ago

test_transcribe_importance passes for me. Perhaps you need to do:

make force-rebuild-data-loaded-images

CarlinLiao commented 1 year ago

That test works, I mean the new module test in apiserver to retrieve importance cell counts for a given phenotype, test_count_importance.

jimmymathews commented 1 year ago

I changed the two total expected counts values hard coded in the test, since the test seemed to be pulling non-trivial values correctly. I didn't manually verify that the changed values are correct (i.e. by looking at the original CSV files).

jimmymathews commented 1 year ago

Are the feature values actually integer ranks? It seems the uploaded uploads values straight from CSV, and the examples committed are floats.

I think this could cause the selection limitation with cell_limit=... to fail.

Edit: I see that _group_and_filter is supposed to create the importance order ints.

jimmymathews commented 1 year ago

I think your SQL query restriction to components.analysis data analysis study is not what is intended. The get_study_components convenience function gets "the" primary data analysis study, whereas the importance score upload explicitly used its own custom data analysis study.

CarlinLiao commented 1 year ago

What's the best way to retrieve the custom CGGNN data analysis study? Could this be added to StudyAccess?

CarlinLiao commented 1 year ago

And yes, the upload importance score functions translate float importance scores into int rankings.

jimmymathews commented 1 year ago

I think I spotted the issue. Here is study_component:

('Melanoma intralesional IL2', 'Melanoma intralesional IL2 - measurement')
('Melanoma intralesional IL2', 'Melanoma intralesional IL2 - specimen collection')
('Melanoma intralesional IL2', 'Melanoma intralesional IL2 - data analysis')
('Melanoma intralesional IL2', 'Melanoma intralesional IL2 : phenotype fractions : 2023-09-29 18:04:52.839308')
('Melanoma intralesional IL2 - measurement', 'Melanoma intralesional IL2 - measurement : cell importance : 2023-09-29 18:05:29.221342')
('Melanoma intralesional IL2', 'Melanoma intralesional IL2 - ondemand computed features')

The function that infers the study from a dataframe doesn't infer all the way up to the main study. Working on it.

jimmymathews commented 1 year ago

Since the API endpoint doesn't support selection of which iteration of GNN runs to retrieve importance scores for, I think it's safe for now to assume there will be just one CGGNN substudy (even though the DB disambiguates in theory, as there is 1 specifier available for identifying the cohort stratification scheme used before training the GNN). So it's fine to just select by derivation_method like we are doing now.

jimmymathews commented 1 year ago

The update to the main counts-retrieval function to accept cell sets breaks the simple caching functionality that saved a lot of time. This caching is implemented as a decorator that checks whether the arguments have been seen before, but if the arguments include a set, the arguments tuple is not hashable. For this reason I'm rolling this back to a list or tuple.

jimmymathews commented 1 year ago

Doesn't this section:

https://github.com/nadeemlab/SPT/blob/3f25fe6a1d2dd6301f4eab4d9a2ebbbb6d1eacd9/spatialprofilingtoolbox/db/importance_score_transcriber.py#L87-L92

set the importance rank integer globally, not per specimen?

jimmymathews commented 1 year ago

It seems that the TCP transmission is getting arbitrarily truncated. In the query to the counts service I see all 50 * 7 cell IDs in the payload, but in the query received there is only a portion of them.

jimmymathews commented 1 year ago

Well darn, there was a hardcoded byte limit for these TCP requests. I guess I was trying to limit the risk of a security issue. I raised the limit.

jimmymathews commented 1 year ago

Now everything is pretty close, I'm just seeing slight discrepancies in the counts for each specimen. Plus or minus 2 or 3 cells. Will follow up next week.

CarlinLiao commented 1 year ago

Good catch on the importance_order calculation. Not only did it set the rank globally instead of per specimen, it also gave the smallest scores the highest rank instead of the largest. I fixed it based on this StackOverflow post, and did a little code golf to avoid grouping twice, once to get the higher ranked values and again to calculate the exact ranks. It does mean we're calculating ranks for cells we're not returning, but since it's simple addition I doubt it's much of a drag. The CSVs are so simple the whole thing probably runs in trivial time anyway, even if we did group it twice.

jimmymathews commented 1 year ago

All tests pass, including test for each specimen's count for those top 50.

CarlinLiao commented 1 year ago

Ready to merge?

nadeemlab / SPT

Get fraction of important cells expressing phenotype #211