create cg-gnn extraction script

CarlinLiao commented 1 year ago

This creates a spt cggnn extract command that pulls information cg-gnn needs from a SPT db instance and saves it to two DataFrames and a JSON.

CarlinLiao commented 1 year ago

Installed this branch to test its functionality. It errored on this line after testing with "Breast cancer IMC":

Traceback (most recent call last):
  File "/Users/cliao/miniconda3/envs/spt_use/lib/python3.11/site-packages/spatialprofilingtoolbox/cggnn/scripts/extract.py", line 39, in <module>
    df_cell, df_label, label_to_result = extract_cggnn_data(args.spt_db_config_location, args.study)
                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/cliao/miniconda3/envs/spt_use/lib/python3.11/site-packages/spatialprofilingtoolbox/cggnn/extract.py", line 50, in extract_cggnn_data
    {slide: data['dataframe'] for slide, data in study_data['feature matrices'].items()},
                                                 ~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^
KeyError: 'feature matrices'

study_data is extractor.extract(study=study) so I don't see why it would break here.

jimmymathews commented 1 year ago

Hm it seems that FeatureMatrixExtractor is pedantically interested in the specimen_measurement_study names only. This is related to the fact that it does not bother to look for composite phenotypes, since these are part of the data analysis study phase, not the data measurement phase of the study.

Still, it would much more convenient if FeatureMatrixExtractor took responsibility for looking up the measurement study name from the overall study handle string. In the current implemention, you have to lookup the measurement study name yourself (most likely "Breast cancer IMC - measurement"?).

CarlinLiao commented 1 year ago

Yes. In the SQL query workflow of the standalone cg-gnn pip package it queries the measurement study for the targets and shape strings, the analysis study for phenotypes, and the specimen collection study for the stratification. The FeatureMatrixExtractor has elements sourced from all three (sub)studies, so it's interesting that they aren't wholly reflected in the current implementation.

jimmymathews commented 1 year ago

Yes, it is a legacy aspect (like the "Feature Matrix" in the name), reflecting its original purpose of providing a familiar feature matrix representation of the portion of the dataset which is representable this way.

CarlinLiao commented 1 year ago

The complex phenotypes features would be consistent with the feature matrix conceptualization.

CarlinLiao commented 1 year ago

I wrote a test for the extract function and it errors because SPT is written in Python 3.11 but the Docker image for set cggnn is Python 3.7 for DGL and PyTorch compatibility. This causes the test to error on new 3.11 syntax like match in FeatureMatrixExtractor. How should we proceed?

(It also complains about unary operators in these lines, but they're identical to other tests I assume Bash SyntaxErrors aren't breaking, and only reported if something else breaks, as the above.)

[ $status -eq 0 ] || echo "cggnn extract failed."

if [ $status -eq 0 ];

jimmymathews commented 1 year ago

Well 3.7 is pretty old, released 2018. According to this, DGL now supports python 3.11. PyTorch I believe also does. I'm trying a newer base image than pytorch/pytorch:1.13.0-cuda11.6-cudnn8-runtime, and it seems to be building OK. I had to rebuild the cg-gnn wheel since it is built with a declared requirement of python ~3.9.

jimmymathews commented 1 year ago

As for the unary operator complaint, it is because your status=... line does not return anything, status is empty. I changed to:

$([ $? -gt 0 ] && [ -e "label_to_results.json" ] && [ -e "cells.h5" ] && [ -e "labels.h5" ])
status="$?"
...

and the error reports correctly (no unary operator complaint).

jimmymathews commented 1 year ago

The new docker image works for me, the test extraction works. I pushed a proof of concept to branch cggnn_dfs3.11. It relies on a locally-built cg-gnn wheel that removes the python ~3.9 constraint.

CarlinLiao commented 1 year ago

Extraction shouldn't run into any issues, it's the run pipeline I'm worried about. Histocartography model training and testing use esoteric features of DGL and pytorch that break in newer versions of both and Python (as of my testing last year, at least). I had to install specific older versions of DGL and pytorch, not the newest versions. (In fact, I wouldn't be surprised if they were the reason we're only getting zero imprtance scores..)

CarlinLiao commented 1 year ago

Were you able to take another look at the complex phenotype extraction as well as the stratification method?

jimmymathews commented 1 year ago

If the dependency conflicts are persistent, it probably makes sense for us to keep major functionality isolated to separate CLI tool instances, not requiring the same python runtime for SPT and for cg-gnn.

jimmymathews commented 1 year ago

I guess that's what you had in mind by creating HDF5 files. It seems fine for this extraction-specific code to be in spt cggnn submodule, where it will run best on python 3.11.

CarlinLiao commented 1 year ago

I'm going to test if the latest changes in the histocartography repo fix the dependency issues, but in the event they don't help, yes we'll need to further isolate cggnn running, either by removing spt cggnn run entirely or by having it call a separate container instance that has Python 3.9 and the cg-gnn pip package.

jimmymathews commented 1 year ago

What would you like me to look at for complex phenotype extraction and stratification? As for phenotype extraction, It seems that what is needed is a classifier utility that pulls out the signatures and assesses each cell in the cell dataframe for membership in the class associated with each given signature. That should be pretty easy. This snippet will get you started with pulling phenotypes:

import json

from pandas import read_sql

from spatialprofilingtoolbox.db.database_connection import DatabaseConnectionMaker
from spatialprofilingtoolbox.db.exchange_data_formats.metrics import PhenotypeCriteria

class PhenotypeDefinitionPuller:
    @staticmethod
    def _get_phenotype(criteria_dataframe):
        positives, negatives = [
            sorted(map(str, list(criteria_dataframe[
                criteria_dataframe['coded_value']==coded_value
            ]['channel_symbol'])))
            for coded_value in (1, 0)
        ]
        return PhenotypeCriteria(positive_markers=positives, negative_markers=negatives)

    @classmethod
    def pull(cls, study):
        query = r'''
        SELECT
            cp.name as phenotype_name,
            cs.symbol as channel_symbol,
            CASE WHEN cpc.polarity='positive' THEN 1 ELSE 0 END AS coded_value
        FROM cell_phenotype_criterion cpc
        JOIN cell_phenotype cp ON cp.identifier=cpc.cell_phenotype
        JOIN chemical_species cs ON cs.identifier=cpc.marker
        JOIN study_component sc ON sc.component_study=cpc.study
        WHERE sc.primary_study='%s'
        ;
        '''
        with DatabaseConnectionMaker() as dcm:
            df = read_sql(query % study, dcm.get_connection())
        return {
            key: cls._get_phenotype(group)
            for key, group in df.groupby('phenotype_name')
        }

phenotypes = PhenotypeDefinitionPuller.pull('Breast cancer IMC')
for key, phenotype_definition in phenotypes.items():
    print('')
    print(key)
    print(phenotype_definition)

phenotypes = PhenotypeDefinitionPuller.pull('Melanoma intralesional IL2')
for key, phenotype_definition in phenotypes.items():
    print('')
    print(key)
    print(phenotype_definition)

If you have a local db running, you can run the above as

SINGLE_CELL_DATABASE_HOST=localhost SINGLE_CELL_DATABASE_USER=postgres SINGLE_CELL_DATABASE_PASSWORD=postgres python snippet.py

The output:

Luminal
positive_markers=['ESR1', 'PGR'] negative_markers=[]

Myoepithelial
positive_markers=['ACTA2', 'CK', 'EGFR', 'KRT14', 'MYC'] negative_markers=[]

Proliferative
positive_markers=['MKI67'] negative_markers=[]

Adipocyte or Langerhans cell
positive_markers=['S100B'] negative_markers=['CD14', 'CD163', 'CD20', 'CD3', 'CD4', 'CD56', 'CD68', 'CD8', 'FOXP3', 'SOX10']

B cell
positive_markers=['CD20'] negative_markers=['CD3', 'CD4', 'CD56', 'CD8', 'FOXP3', 'SOX10']

CD163+MHCII+ macrophage
positive_markers=['CD163', 'MHCII'] negative_markers=['CD20', 'CD3', 'CD56', 'CD8', 'FOXP3', 'SOX10']

CD163+MHCII- macrophage
positive_markers=['CD163'] negative_markers=['CD20', 'CD3', 'CD56', 'CD8', 'FOXP3', 'MHCII', 'SOX10']

CD4+ T cell
positive_markers=['CD3', 'CD4'] negative_markers=['CD20', 'CD56', 'CD8', 'FOXP3', 'SOX10']

CD4+ natural killer T cell
positive_markers=['CD3', 'CD4', 'CD56'] negative_markers=['CD20', 'CD8', 'FOXP3', 'SOX10']

CD4+ regulatory T cell
positive_markers=['CD3', 'CD4', 'FOXP3'] negative_markers=['CD20', 'CD56', 'CD8', 'SOX10']

CD4+/CD8+ T cell
positive_markers=['CD3', 'CD4', 'CD8'] negative_markers=['CD20', 'CD56', 'FOXP3', 'SOX10']

CD68+MHCII+ macrophage
positive_markers=['CD68', 'MHCII'] negative_markers=['CD20', 'CD3', 'CD56', 'CD8', 'FOXP3', 'SOX10']

CD68+MHCII- macrophage
positive_markers=['CD68'] negative_markers=['CD20', 'CD3', 'CD56', 'CD8', 'FOXP3', 'MHCII', 'SOX10']

CD8+ T cell
positive_markers=['CD3', 'CD8'] negative_markers=['CD20', 'CD4', 'CD56', 'FOXP3', 'SOX10']

CD8+ natural killer T cell
positive_markers=['CD3', 'CD56', 'CD8'] negative_markers=['CD20', 'CD4', 'FOXP3', 'SOX10']

CD8+ regulatory T cell
positive_markers=['CD3', 'CD8', 'FOXP3'] negative_markers=['CD20', 'CD4', 'CD56', 'SOX10']

Double negative regulatory T cell
positive_markers=['CD3', 'FOXP3'] negative_markers=['CD20', 'CD4', 'CD56', 'CD8', 'SOX10']

Natural killer T cell
positive_markers=['CD3', 'CD56'] negative_markers=['CD20', 'CD4', 'CD8', 'FOXP3', 'SOX10']

Natural killer cell
positive_markers=['CD56'] negative_markers=['CD20', 'CD3', 'CD4', 'CD8', 'FOXP3', 'S100B', 'SOX10']

Nerve
positive_markers=['CD56', 'S100B'] negative_markers=['CD14', 'CD163', 'CD20', 'CD3', 'CD4', 'CD68', 'CD8', 'FOXP3', 'SOX10']

Other macrophage/monocyte CD14+
positive_markers=['CD14'] negative_markers=['CD163', 'CD20', 'CD3', 'CD56', 'CD68', 'CD8', 'FOXP3', 'SOX10']

Other macrophage/monocyte CD4+
positive_markers=['CD4'] negative_markers=['CD163', 'CD20', 'CD3', 'CD56', 'CD68', 'CD8', 'FOXP3', 'SOX10']

T cell/null phenotype
positive_markers=['CD3'] negative_markers=['CD20', 'CD4', 'CD56', 'CD8', 'FOXP3', 'SOX10']

Tumor
positive_markers=['SOX10'] negative_markers=['CD20', 'CD3', 'CD4', 'CD8', 'FOXP3']

CarlinLiao commented 1 year ago

I've yet to setup a workflow that gives me a local db to query. Perhaps a tutorial in docs/ would be good for that.

Is your suggestion to add in functionality to spt cggnn extract so that it generates complex phenotype features itself (thus the snippet example), or to extend FeatureMatrixExtractor so that it extracts complex phenotypes in addition to simple ones?

jimmymathews commented 1 year ago

I think the functionality is appropriate for a new class or functions near FeatureMatrixExtractor. If you want deep integration with FeatureMatrixExtractor, such that the cell dataframe has new columns for each phenotype, you'll have to make some decision for naming these new columns. You could just use the long names as in the snippet, but as you have pointed out above this hides the distinction between the original-channel-associated columns and the phenotype columns. (Note that "simple" and "complex" phenotype is personal terminology of my own that will be confusing to most others. "Phenotype" refers to the latter concept normally.) I would probably keep the phenotype columns in a separate dataframe to avoid that problem.

CarlinLiao commented 1 year ago

If the phenotypes are kept in a separate DataFrame, then FeatureMatrixExtractor would just return another DataFrame in its result dictionary, right? That's simple to understand. spt cggnn extract can handle merging the chemical species(?) and phenotypes together in its own logic.

Would you like me to try implementing this, or would you rather do it yourself since you're more familiar with the architecture of the extractor?

jimmymathews commented 1 year ago

Sure, please give it a try. I don't think it requires too much architectural knowledge. And yeah I was imagining another dataframe in the dict, beside the existing one.

(I'm thinking of the primary columns as "channels" as in "imaging channels". The "chemical species" terminology is the pedantic description of the target indicated by the imaging modality. The column name is the protein's gene name, so yeah this is the name of the "chemical species" indicated, but again no one else is going to call this the "chemical species".)

CarlinLiao commented 1 year ago

I'm not sure we're aligned on what a sufficient amount of architectural knowledge is, but I'll give it a crack.

jimmymathews commented 1 year ago

For the phenotype definition puller, note that this is substantially implemented already here:

https://github.com/nadeemlab/SPT/blob/9a57ec8d205e5e9561dcad4c55c2284c95d16854/spatialprofilingtoolbox/db/phenotypes.py#L50

Usage example from the querying.py module:

PhenotypesAccess(cursor).get_phenotype_criteria(study, phenotype_symbol)

jimmymathews commented 1 year ago

Tests now pass. Is this mergeable? You could make an issue out of the planned work related to outcome-variable provision, so as not to block merging of this PR (which is just about the major extractions from the db).

CarlinLiao commented 1 year ago

This should be ready to merge, pending tests passing.

The new build no longer errors, but the FME test fails because the DataFrame shape isn't as expected, I suspect because the new phenotype functionality adds columns to the expected DataFrame. Can we confirm that Melanoma intralesional IL2 as represented in the tests has 2 phenotypes?

CarlinLiao commented 1 year ago

Confirmed that the four additional columns are due to the four phenotypes picked up by the accessor. (Whether or not 4 phenotypes is accurate is another question.)

Something I'm noticing in the logs, the channels/features are floats and not binary or true-false values, which I assumed they were when doing the positive and negative matching against the phenotype signatures. Consequently, every phenotype for ever cell is reporting as a 1. What should be done instead?

CarlinLiao commented 1 year ago

Why is the test for db/FeatureMatrixExtractor in test/workflow?

CarlinLiao commented 1 year ago

Perhaps we should replace references to chemical species / channels as "features" in order to disambiguate between the features being extracted (channels, phenotypes, strata, etc.) and chemicals. And if we could standardize on species or channel that would also be clarifying.

This includes replacing the f{'F{I}'} prefix with something else, like f{'C{I}'}.

jimmymathews commented 1 year ago

Depending on which test, a different test dataset might be used. The number of phenotypes in the test datasets is either 4 or 3. The 2 mel test datasets both have 4 phenotypes, that's probably right.
Regarding the data type in the dataframes produced by FeatureMatrixExtractor: Float values of 0 or 1 seem ok, as well as True vs False booleans, or ints. All can faithfully represent the indication, with boolean probably being the best choice for the purpose of bitwise operations and storage. We just need the phenotype assignment function to work on whatever we choose.
I think the only reason for the FeatureMatrixExtractor test to be in the workflow area vs db is that the preloaded data used for the workflow testing is much slimmer and more trivial than in the db testing. Here it shows that these are application/integration tests and not library unit tests.
Between the terms "chemical species" and "channel" in source code we should prefer "channel". That change from e.g. F1 to C1 is fine. Keep in mind that the purpose of using column names like "F2", "F3", ... was for them to be arbitrary -- you're meant to look up the actual names in the supplementary dictionary when you need the name. So "F2" and "F3" may as well be randomly assigned uniquely identifying strings. If we can find some kind of promise in pandas docs that any type of string is allowed as a column name, we could deprecate these altogether and directly put the channel names and phenotype names here.

jimmymathews commented 1 year ago

Tests now pass for me.

CarlinLiao commented 1 year ago

What's the purpose of FeatureMatrixExtractor.redact_dataframes()? @jimmymathews

jimmymathews commented 1 year ago

What's the purpose of FeatureMatrixExtractor.redact_dataframes()? @jimmymathews

It makes it possible to create a JSON dump of everything else, for inspection.

(It is not essential)

CarlinLiao commented 1 year ago

Ok, I've changed some functionality that arise from each other.

I've integrated channel and phenotype names into the cell extract dataframes,
This slimmed down the extract return packet from 4 entries to just two, cells and stratification, so
I split the extract function's concern to focus solely on returning cell information, as often the stratification return result was unused, such as in the case of a single-specimen extraction, so that's unnecessary work. Stratification information is now a separate function, extract_cohorts
Focusing extract gave me the opportunity to be more precise about the return type, so now it returns a dict mapping specimen string names to MatrixBundles, which are a dataclass with three properties: dataframe, filename, and continuous_dataframe (optional)
(This also made the return results not just dicts, strings, and DataFrames, so I remove redact_dataframes after getting halfway through a replacement implementation that created a wrapping dict for the sole purpose of dumping it. It shouldn't be too hard to bring this functionality back, but I feel like its sole purpose now would just be to print the constructed cell DataFrame filenames.)

I think this is a really good change for the sake of clarity, but I may not have fixed the tests correctly. I think I caught all downstream changes from the return type, at least.

Next on the docket is adding in the proper stratification interpretation to cggnn extract, AKA the original purpose of this PR before I scope creeped it with all these modifications to FeatureMatrixExtractor.

CarlinLiao commented 1 year ago

I do remember something about dataclasses not playing nice with FastAPI or something. If so, it should be an easy enough fix by either replacing it with a NamedTuple or something similar that can still promise fixed options (as opposed to mutable key dictionaries, which I think we should avoid).

jimmymathews commented 1 year ago

I think dataclasses are fine, the only FastAPI thing is that FastAPI is extra helpful when the response types are hinted with pydantic types. And query parameters I guess also.

CarlinLiao commented 1 year ago

Ok, I don't think FME and FastAPI touch at all either.

CarlinLiao commented 1 year ago

Resolved picking out strata to classify on by adding some fake interactivity by having the cggnn module print out the entire strata output from FME for the user to observe. From that, they can (manually) pick the strata they want to use by their identifier and then pass them to the extract command (by typing them one by one after the argument flag).

CarlinLiao commented 1 year ago

unit-test-cggnn and unit-test-workflow pass for me. Are we ready to merge?

jimmymathews commented 1 year ago

I'll do tests and review now.

jimmymathews commented 1 year ago

Test test_autocomputed_squdpy.py fails. I think what is going on is that the feature identifiers minted for these "autocomputed squidpy metrics" are not consistent with the ones that were minted before the changes, for whatever reason. This could happen if new metrics are computed beforehand, using up the identifier namespace, or if some are no longer computed beforehand.

We could relax the test and just ensure that the set of feature vectors (identifiers omitted) is still the same. That would mean making changes here to omit identifiers from the expected row set and the actual row set before testing for equality.

jimmymathews commented 1 year ago

I'm also getting test failure for test_proximity_pipeline.sh:

Traceback (most recent call last):
  File "/usr/local/lib/python3.11/dist-packages/spatialprofilingtoolbox/workflow/scripts/aggregate_core_results.py", line 24, in <module>
    integrator.calculate(**args)
  File "/usr/local/lib/python3.11/dist-packages/spatialprofilingtoolbox/workflow/phenotype_proximity/integrator.py", line 36, in calculate
    self.export_feature_values(core_computation_results_files, data_analysis_study)
  File "/usr/local/lib/python3.11/dist-packages/spatialprofilingtoolbox/workflow/phenotype_proximity/integrator.py", line 66, in export_feature_values
    self.send_features_to_uploader(feature_uploader, core_computation_results_files)
  File "/usr/local/lib/python3.11/dist-packages/spatialprofilingtoolbox/workflow/phenotype_proximity/integrator.py", line 72, in send_features_to_uploader
    stage_proximity_feature_values(
  File "/usr/local/lib/python3.11/dist-packages/spatialprofilingtoolbox/workflow/common/proximity.py", line 70, in stage_proximity_feature_values
    _phenotype_identifier_lookup(row['Phenotype 1'], channel_symbols_by_column_name),
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/spatialprofilingtoolbox/workflow/common/proximity.py", line 59, in _phenotype_identifier_lookup
    raise ValueError(f'Did not understand meaning of specifier: {handle}')
ValueError: Did not understand meaning of specifier: C B2M

It seems that the "API" break for FeatureMatrixExtractor (specifically the column-to-name lookup) isn't accounted for.

CarlinLiao commented 1 year ago

test_proximity_pipeline should was as simple as fixing some regex, but now it's failing at the same feature specification area as test_autocomputed_squidpy. Hopefully this'll make it easier to debug for me since any squidpy tests will autofail on my machine because squidpy can't install on my MacBook.

CarlinLiao commented 1 year ago

The latest commit should fix the inconsistently minted feature specifications without relaxing the tests. (The tests that can run on my M1 Mac pass, but that leaves out a lot of tests.)

CarlinLiao commented 1 year ago

Almost all tests pass now and merging this PR will quash several open issues. The only exception are the proximity and squidpy module tests in apiserver, which run but result in different values compared to expected. Not totally sure why, but this could be a case of updated computations leading to new values rather than the output being wrong.

jimmymathews commented 1 year ago

The tests that check computed feature values (squidpy and proximity) are failing for the first time at the commit that introduces a new way to compute masks, see these lines:

https://github.com/nadeemlab/SPT/blob/f3f6178eef9fb9f321d615b630806e450bbe5a84/spatialprofilingtoolbox/workflow/common/squidpy.py#L39-L47

Most likely these masks differ. I'll try to find out which one is correct, and if the new one is not correct we can correct it.

jimmymathews commented 1 year ago

The masks in this version were wrong according to one example I checked, and the previously computed masks were correct. I think this was due to the .all pandas function being applied to 0 and 1 values. I converted to bool in the previous commit, and the mask in the example I checked is now correct.

jimmymathews commented 1 year ago

The squidpy metrics test now passes. Working on proximity metrics test.

jimmymathews commented 1 year ago

All tests pass!

nadeemlab / SPT

create cg-gnn extraction script #185