Closed CarlinLiao closed 1 year ago
Installed this branch to test its functionality. It errored on this line after testing with "Breast cancer IMC":
Traceback (most recent call last):
File "/Users/cliao/miniconda3/envs/spt_use/lib/python3.11/site-packages/spatialprofilingtoolbox/cggnn/scripts/extract.py", line 39, in <module>
df_cell, df_label, label_to_result = extract_cggnn_data(args.spt_db_config_location, args.study)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/cliao/miniconda3/envs/spt_use/lib/python3.11/site-packages/spatialprofilingtoolbox/cggnn/extract.py", line 50, in extract_cggnn_data
{slide: data['dataframe'] for slide, data in study_data['feature matrices'].items()},
~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^
KeyError: 'feature matrices'
study_data
is extractor.extract(study=study)
so I don't see why it would break here.
Hm it seems that FeatureMatrixExtractor
is pedantically interested in the specimen_measurement_study
names only. This is related to the fact that it does not bother to look for composite phenotypes, since these are part of the data analysis study phase, not the data measurement phase of the study.
Still, it would much more convenient if FeatureMatrixExtractor
took responsibility for looking up the measurement study name from the overall study handle string. In the current implemention, you have to lookup the measurement study name yourself (most likely "Breast cancer IMC - measurement"?).
Yes. In the SQL query workflow of the standalone cg-gnn
pip package it queries the measurement study for the targets and shape strings, the analysis study for phenotypes, and the specimen collection study for the stratification. The FeatureMatrixExtractor
has elements sourced from all three (sub)studies, so it's interesting that they aren't wholly reflected in the current implementation.
Yes, it is a legacy aspect (like the "Feature Matrix" in the name), reflecting its original purpose of providing a familiar feature matrix representation of the portion of the dataset which is representable this way.
The complex phenotypes features would be consistent with the feature matrix conceptualization.
I wrote a test for the extract function and it errors because SPT is written in Python 3.11 but the Docker image for set cggnn
is Python 3.7 for DGL and PyTorch compatibility. This causes the test to error on new 3.11 syntax like match
in FeatureMatrixExtractor
. How should we proceed?
(It also complains about unary operators in these lines, but they're identical to other tests I assume Bash SyntaxErrors aren't breaking, and only reported if something else breaks, as the above.)
[ $status -eq 0 ] || echo "cggnn extract failed."
if [ $status -eq 0 ];
Well 3.7 is pretty old, released 2018. According to this, DGL now supports python 3.11. PyTorch I believe also does.
I'm trying a newer base image than pytorch/pytorch:1.13.0-cuda11.6-cudnn8-runtime
, and it seems to be building OK.
I had to rebuild the cg-gnn
wheel since it is built with a declared requirement of python ~3.9
.
As for the unary operator complaint, it is because your status=...
line does not return anything, status
is empty. I changed to:
$([ $? -gt 0 ] && [ -e "label_to_results.json" ] && [ -e "cells.h5" ] && [ -e "labels.h5" ])
status="$?"
...
and the error reports correctly (no unary operator complaint).
The new docker image works for me, the test extraction works. I pushed a proof of concept to branch cggnn_dfs3.11
. It relies on a locally-built cg-gnn
wheel that removes the python ~3.9
constraint.
Extraction shouldn't run into any issues, it's the run pipeline I'm worried about. Histocartography model training and testing use esoteric features of DGL and pytorch that break in newer versions of both and Python (as of my testing last year, at least). I had to install specific older versions of DGL and pytorch, not the newest versions. (In fact, I wouldn't be surprised if they were the reason we're only getting zero imprtance scores..)
Were you able to take another look at the complex phenotype extraction as well as the stratification method?
If the dependency conflicts are persistent, it probably makes sense for us to keep major functionality isolated to separate CLI tool instances, not requiring the same python runtime for SPT and for cg-gnn.
I guess that's what you had in mind by creating HDF5 files. It seems fine for this extraction-specific code to be in spt cggnn submodule, where it will run best on python 3.11.
I'm going to test if the latest changes in the histocartography repo fix the dependency issues, but in the event they don't help, yes we'll need to further isolate cggnn running, either by removing spt cggnn run
entirely or by having it call a separate container instance that has Python 3.9 and the cg-gnn
pip package.
What would you like me to look at for complex phenotype extraction and stratification? As for phenotype extraction, It seems that what is needed is a classifier utility that pulls out the signatures and assesses each cell in the cell dataframe for membership in the class associated with each given signature. That should be pretty easy. This snippet will get you started with pulling phenotypes:
import json
from pandas import read_sql
from spatialprofilingtoolbox.db.database_connection import DatabaseConnectionMaker
from spatialprofilingtoolbox.db.exchange_data_formats.metrics import PhenotypeCriteria
class PhenotypeDefinitionPuller:
@staticmethod
def _get_phenotype(criteria_dataframe):
positives, negatives = [
sorted(map(str, list(criteria_dataframe[
criteria_dataframe['coded_value']==coded_value
]['channel_symbol'])))
for coded_value in (1, 0)
]
return PhenotypeCriteria(positive_markers=positives, negative_markers=negatives)
@classmethod
def pull(cls, study):
query = r'''
SELECT
cp.name as phenotype_name,
cs.symbol as channel_symbol,
CASE WHEN cpc.polarity='positive' THEN 1 ELSE 0 END AS coded_value
FROM cell_phenotype_criterion cpc
JOIN cell_phenotype cp ON cp.identifier=cpc.cell_phenotype
JOIN chemical_species cs ON cs.identifier=cpc.marker
JOIN study_component sc ON sc.component_study=cpc.study
WHERE sc.primary_study='%s'
;
'''
with DatabaseConnectionMaker() as dcm:
df = read_sql(query % study, dcm.get_connection())
return {
key: cls._get_phenotype(group)
for key, group in df.groupby('phenotype_name')
}
phenotypes = PhenotypeDefinitionPuller.pull('Breast cancer IMC')
for key, phenotype_definition in phenotypes.items():
print('')
print(key)
print(phenotype_definition)
phenotypes = PhenotypeDefinitionPuller.pull('Melanoma intralesional IL2')
for key, phenotype_definition in phenotypes.items():
print('')
print(key)
print(phenotype_definition)
If you have a local db running, you can run the above as
SINGLE_CELL_DATABASE_HOST=localhost SINGLE_CELL_DATABASE_USER=postgres SINGLE_CELL_DATABASE_PASSWORD=postgres python snippet.py
The output:
Luminal
positive_markers=['ESR1', 'PGR'] negative_markers=[]
Myoepithelial
positive_markers=['ACTA2', 'CK', 'EGFR', 'KRT14', 'MYC'] negative_markers=[]
Proliferative
positive_markers=['MKI67'] negative_markers=[]
Adipocyte or Langerhans cell
positive_markers=['S100B'] negative_markers=['CD14', 'CD163', 'CD20', 'CD3', 'CD4', 'CD56', 'CD68', 'CD8', 'FOXP3', 'SOX10']
B cell
positive_markers=['CD20'] negative_markers=['CD3', 'CD4', 'CD56', 'CD8', 'FOXP3', 'SOX10']
CD163+MHCII+ macrophage
positive_markers=['CD163', 'MHCII'] negative_markers=['CD20', 'CD3', 'CD56', 'CD8', 'FOXP3', 'SOX10']
CD163+MHCII- macrophage
positive_markers=['CD163'] negative_markers=['CD20', 'CD3', 'CD56', 'CD8', 'FOXP3', 'MHCII', 'SOX10']
CD4+ T cell
positive_markers=['CD3', 'CD4'] negative_markers=['CD20', 'CD56', 'CD8', 'FOXP3', 'SOX10']
CD4+ natural killer T cell
positive_markers=['CD3', 'CD4', 'CD56'] negative_markers=['CD20', 'CD8', 'FOXP3', 'SOX10']
CD4+ regulatory T cell
positive_markers=['CD3', 'CD4', 'FOXP3'] negative_markers=['CD20', 'CD56', 'CD8', 'SOX10']
CD4+/CD8+ T cell
positive_markers=['CD3', 'CD4', 'CD8'] negative_markers=['CD20', 'CD56', 'FOXP3', 'SOX10']
CD68+MHCII+ macrophage
positive_markers=['CD68', 'MHCII'] negative_markers=['CD20', 'CD3', 'CD56', 'CD8', 'FOXP3', 'SOX10']
CD68+MHCII- macrophage
positive_markers=['CD68'] negative_markers=['CD20', 'CD3', 'CD56', 'CD8', 'FOXP3', 'MHCII', 'SOX10']
CD8+ T cell
positive_markers=['CD3', 'CD8'] negative_markers=['CD20', 'CD4', 'CD56', 'FOXP3', 'SOX10']
CD8+ natural killer T cell
positive_markers=['CD3', 'CD56', 'CD8'] negative_markers=['CD20', 'CD4', 'FOXP3', 'SOX10']
CD8+ regulatory T cell
positive_markers=['CD3', 'CD8', 'FOXP3'] negative_markers=['CD20', 'CD4', 'CD56', 'SOX10']
Double negative regulatory T cell
positive_markers=['CD3', 'FOXP3'] negative_markers=['CD20', 'CD4', 'CD56', 'CD8', 'SOX10']
Natural killer T cell
positive_markers=['CD3', 'CD56'] negative_markers=['CD20', 'CD4', 'CD8', 'FOXP3', 'SOX10']
Natural killer cell
positive_markers=['CD56'] negative_markers=['CD20', 'CD3', 'CD4', 'CD8', 'FOXP3', 'S100B', 'SOX10']
Nerve
positive_markers=['CD56', 'S100B'] negative_markers=['CD14', 'CD163', 'CD20', 'CD3', 'CD4', 'CD68', 'CD8', 'FOXP3', 'SOX10']
Other macrophage/monocyte CD14+
positive_markers=['CD14'] negative_markers=['CD163', 'CD20', 'CD3', 'CD56', 'CD68', 'CD8', 'FOXP3', 'SOX10']
Other macrophage/monocyte CD4+
positive_markers=['CD4'] negative_markers=['CD163', 'CD20', 'CD3', 'CD56', 'CD68', 'CD8', 'FOXP3', 'SOX10']
T cell/null phenotype
positive_markers=['CD3'] negative_markers=['CD20', 'CD4', 'CD56', 'CD8', 'FOXP3', 'SOX10']
Tumor
positive_markers=['SOX10'] negative_markers=['CD20', 'CD3', 'CD4', 'CD8', 'FOXP3']
I've yet to setup a workflow that gives me a local db to query. Perhaps a tutorial in docs/
would be good for that.
Is your suggestion to add in functionality to spt cggnn extract
so that it generates complex phenotype features itself (thus the snippet example), or to extend FeatureMatrixExtractor
so that it extracts complex phenotypes in addition to simple ones?
I think the functionality is appropriate for a new class or functions near FeatureMatrixExtractor
.
If you want deep integration with FeatureMatrixExtractor
, such that the cell dataframe has new columns for each phenotype, you'll have to make some decision for naming these new columns. You could just use the long names as in the snippet, but as you have pointed out above this hides the distinction between the original-channel-associated columns and the phenotype columns. (Note that "simple" and "complex" phenotype is personal terminology of my own that will be confusing to most others. "Phenotype" refers to the latter concept normally.)
I would probably keep the phenotype columns in a separate dataframe to avoid that problem.
If the phenotypes are kept in a separate DataFrame, then FeatureMatrixExtractor
would just return another DataFrame in its result dictionary, right? That's simple to understand. spt cggnn extract
can handle merging the chemical species(?) and phenotypes together in its own logic.
Would you like me to try implementing this, or would you rather do it yourself since you're more familiar with the architecture of the extractor?
Sure, please give it a try. I don't think it requires too much architectural knowledge. And yeah I was imagining another dataframe in the dict, beside the existing one.
(I'm thinking of the primary columns as "channels" as in "imaging channels". The "chemical species" terminology is the pedantic description of the target indicated by the imaging modality. The column name is the protein's gene name, so yeah this is the name of the "chemical species" indicated, but again no one else is going to call this the "chemical species".)
I'm not sure we're aligned on what a sufficient amount of architectural knowledge is, but I'll give it a crack.
For the phenotype definition puller, note that this is substantially implemented already here:
Usage example from the querying.py
module:
PhenotypesAccess(cursor).get_phenotype_criteria(study, phenotype_symbol)
Tests now pass. Is this mergeable? You could make an issue out of the planned work related to outcome-variable provision, so as not to block merging of this PR (which is just about the major extractions from the db).
This should be ready to merge, pending tests passing.
The new build no longer errors, but the FME test fails because the DataFrame shape isn't as expected, I suspect because the new phenotype functionality adds columns to the expected DataFrame. Can we confirm that Melanoma intralesional IL2
as represented in the tests has 2 phenotypes?
Confirmed that the four additional columns are due to the four phenotypes picked up by the accessor. (Whether or not 4 phenotypes is accurate is another question.)
Something I'm noticing in the logs, the channels/features are floats and not binary or true-false values, which I assumed they were when doing the positive and negative matching against the phenotype signatures. Consequently, every phenotype for ever cell is reporting as a 1. What should be done instead?
Why is the test for db/FeatureMatrixExtractor
in test/workflow
?
Perhaps we should replace references to chemical species / channels as "features" in order to disambiguate between the features being extracted (channels, phenotypes, strata, etc.) and chemicals. And if we could standardize on species or channel that would also be clarifying.
This includes replacing the f{'F{I}'}
prefix with something else, like f{'C{I}'}
.
Tests now pass for me.
What's the purpose of FeatureMatrixExtractor.redact_dataframes()
? @jimmymathews
What's the purpose of
FeatureMatrixExtractor.redact_dataframes()
? @jimmymathews
It makes it possible to create a JSON dump of everything else, for inspection.
(It is not essential)
Ok, I've changed some functionality that arise from each other.
extract
function's concern to focus solely on returning cell information, as often the stratification return result was unused, such as in the case of a single-specimen extraction, so that's unnecessary work. Stratification information is now a separate function, extract_cohorts
extract
gave me the opportunity to be more precise about the return type, so now it returns a dict mapping specimen string names to MatrixBundle
s, which are a dataclass
with three properties: dataframe
, filename
, and continuous_dataframe
(optional)redact_dataframes
after getting halfway through a replacement implementation that created a wrapping dict for the sole purpose of dumping it. It shouldn't be too hard to bring this functionality back, but I feel like its sole purpose now would just be to print the constructed cell DataFrame filenames.)I think this is a really good change for the sake of clarity, but I may not have fixed the tests correctly. I think I caught all downstream changes from the return type, at least.
Next on the docket is adding in the proper stratification interpretation to cggnn extract
, AKA the original purpose of this PR before I scope creeped it with all these modifications to FeatureMatrixExtractor
.
I do remember something about dataclasses
not playing nice with FastAPI or something. If so, it should be an easy enough fix by either replacing it with a NamedTuple
or something similar that can still promise fixed options (as opposed to mutable key dictionaries, which I think we should avoid).
I think dataclasses are fine, the only FastAPI thing is that FastAPI is extra helpful when the response types are hinted with pydantic types. And query parameters I guess also.
Ok, I don't think FME and FastAPI touch at all either.
Resolved picking out strata to classify on by adding some fake interactivity by having the cggnn module print out the entire strata
output from FME for the user to observe. From that, they can (manually) pick the strata they want to use by their identifier and then pass them to the extract command (by typing them one by one after the argument flag).
unit-test-cggnn
and unit-test-workflow
pass for me. Are we ready to merge?
I'll do tests and review now.
Test test_autocomputed_squdpy.py
fails. I think what is going on is that the feature identifiers minted for these "autocomputed squidpy metrics" are not consistent with the ones that were minted before the changes, for whatever reason. This could happen if new metrics are computed beforehand, using up the identifier namespace, or if some are no longer computed beforehand.
We could relax the test and just ensure that the set of feature vectors (identifiers omitted) is still the same. That would mean making changes here to omit identifiers from the expected row set and the actual row set before testing for equality.
I'm also getting test failure for test_proximity_pipeline.sh
:
Traceback (most recent call last):
File "/usr/local/lib/python3.11/dist-packages/spatialprofilingtoolbox/workflow/scripts/aggregate_core_results.py", line 24, in <module>
integrator.calculate(**args)
File "/usr/local/lib/python3.11/dist-packages/spatialprofilingtoolbox/workflow/phenotype_proximity/integrator.py", line 36, in calculate
self.export_feature_values(core_computation_results_files, data_analysis_study)
File "/usr/local/lib/python3.11/dist-packages/spatialprofilingtoolbox/workflow/phenotype_proximity/integrator.py", line 66, in export_feature_values
self.send_features_to_uploader(feature_uploader, core_computation_results_files)
File "/usr/local/lib/python3.11/dist-packages/spatialprofilingtoolbox/workflow/phenotype_proximity/integrator.py", line 72, in send_features_to_uploader
stage_proximity_feature_values(
File "/usr/local/lib/python3.11/dist-packages/spatialprofilingtoolbox/workflow/common/proximity.py", line 70, in stage_proximity_feature_values
_phenotype_identifier_lookup(row['Phenotype 1'], channel_symbols_by_column_name),
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/spatialprofilingtoolbox/workflow/common/proximity.py", line 59, in _phenotype_identifier_lookup
raise ValueError(f'Did not understand meaning of specifier: {handle}')
ValueError: Did not understand meaning of specifier: C B2M
It seems that the "API" break for FeatureMatrixExtractor (specifically the column-to-name lookup) isn't accounted for.
test_proximity_pipeline
should was as simple as fixing some regex, but now it's failing at the same feature specification area as test_autocomputed_squidpy
. Hopefully this'll make it easier to debug for me since any squidpy tests will autofail on my machine because squidpy can't install on my MacBook.
The latest commit should fix the inconsistently minted feature specifications without relaxing the tests. (The tests that can run on my M1 Mac pass, but that leaves out a lot of tests.)
Almost all tests pass now and merging this PR will quash several open issues. The only exception are the proximity and squidpy module tests in apiserver, which run but result in different values compared to expected. Not totally sure why, but this could be a case of updated computations leading to new values rather than the output being wrong.
The tests that check computed feature values (squidpy and proximity) are failing for the first time at the commit that introduces a new way to compute masks, see these lines:
Most likely these masks differ. I'll try to find out which one is correct, and if the new one is not correct we can correct it.
The masks in this version were wrong according to one example I checked, and the previously computed masks were correct. I think this was due to the .all
pandas function being applied to 0 and 1 values. I converted to bool
in the previous commit, and the mask in the example I checked is now correct.
The squidpy metrics test now passes. Working on proximity metrics test.
All tests pass!
This creates a
spt cggnn extract
command that pulls informationcg-gnn
needs from a SPT db instance and saves it to two DataFrames and a JSON.