Closed CarlinLiao closed 1 year ago
Will be converted to a mergeable PR once I can get home and push that commit I left on my computer. That's needed for all tests to pass since bokeh was moved from a pip cg-gnn
prerequisite to actually being used in SPT itself.
That said, this PR in its current state has a working cggnn
workflow. However, in order to properly evaluate it, you need to make a slight change to the cggnn_environment.yml
. Instead of allowing it to install spatialprofilingtoolbox[all]
from pip, comment that line out and install a local build created from this branch. Once that's done, this is what the output will look like if you use it with the CyTOF dataset on RDS:
$ spt workflow configure --local --workflow='cggnn' --study-name='Melanoma CyTOF ICI' --database-config-file='../../spt_db.config' --workflow-config-file='workflow.config'
$ ls
configure.sh main.nf nextflow.config run.sh workflow.config
$ nextflow run .
N E X T F L O W ~ version 23.10.0
Launching `./main.nf` [insane_mayer] DSL2 - revision: 856742884f
executor > local (35)
[09/cacb09] process > echo_environment_variables [100%] 1 of 1 ✔
[ca/03dc41] process > prep_graph_creation [100%] 1 of 1 ✔
[96/95dde9] process > create_specimen_graphs (9) [100%] 30 of 30 ✔
[f7/92863f] process > finalize_graphs [100%] 1 of 1 ✔
[2a/22470e] process > train [100%] 1 of 1 ✔
[f1/fd0f17] process > upload_importances [100%] 1 of 1 ✔
Completed at: 16-Nov-2023 16:36:05
Duration : 8m 26s
CPU hours : 0.2
Succeeded : 35
$ ls
configure.sh main.nf nextflow.config results run.sh work workflow.config
$ ls results/
feature_names.txt graph_info.pkl graphs.bin importances.csv model
$ ls results/model/
classification_report.txt model_best_validation_accuracy.pt model_best_validation_loss.pt model_best_validation_weighted_f1_score.pt
Something to note: I originally had the cggnn
workflow environment only install the cggnn
and workflow
extras, but this would error when running spt workflow configure
because running that would require importing the phenotype proximity workflow, which imports FeatureMatrixExtractor
from db
, which would cause an error because it needs db
extras. So, pip install spatialprofilingtoolbox[all]
.
Interesting error from cggnn extract
.
11-16 23:41:22 [ INFO ] db.feature_matrix_extractor: Retrieving expression data from database.
Traceback (most recent call last):
File "/usr/local/lib/python3.11/dist-packages/spatialprofilingtoolbox/cggnn/scripts/extract.py", line 71, in <module>
df_cell, df_label, label_to_result = extract_cggnn_data(
^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/spatialprofilingtoolbox/cggnn/extract.py", line 135, in extract_cggnn_data
for specimen, data in extractor.extract(study=study, retain_structure_id=True).items()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/spatialprofilingtoolbox/db/feature_matrix_extractor.py", line 80, in extract
extraction = self._extract(
^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/spatialprofilingtoolbox/db/feature_matrix_extractor.py", line 98, in _extract
data_arrays = self._retrieve_expressions_from_database(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/spatialprofilingtoolbox/db/feature_matrix_extractor.py", line 129, in _retrieve_expressions_from_database
puller.pull(
File "/usr/local/lib/python3.11/dist-packages/spatialprofilingtoolbox/workflow/common/sparse_matrix_puller.py", line 233, in pull
self._retrieve_data_arrays(
File "/usr/local/lib/python3.11/dist-packages/spatialprofilingtoolbox/workflow/common/sparse_matrix_puller.py", line 253, in _retrieve_data_arrays
self._fill_data_arrays_for_study(
File "/usr/local/lib/python3.11/dist-packages/spatialprofilingtoolbox/workflow/common/sparse_matrix_puller.py", line 276, in _fill_data_arrays_for_study
sparse_entries = self._get_sparse_entries(
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/spatialprofilingtoolbox/workflow/common/sparse_matrix_puller.py", line 355, in _get_sparse_entries
query, parameters = self._get_sparse_matrix_query_specimen_specific(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/spatialprofilingtoolbox/workflow/common/sparse_matrix_puller.py", line 381, in _get_sparse_matrix_query_specimen_specific
range_definition = SparseMatrixPuller._retrieve_expressions_range(cursor, specimen)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/spatialprofilingtoolbox/workflow/common/sparse_matrix_puller.py", line 395, in _retrieve_expressions_range
cursor.execute(query, (scope,))
psycopg2.errors.UndefinedTable: relation "range_definitions" does not exist
LINE 3: FROM range_definitions
^
The range_definitions
table is the replacement for the previous index, introduced in the last few days.
Most likely your data images are outdated.
I am getting a different error (which may indicate accidental omission of some test data artifact?). During plotting testing:
File "/usr/local/lib/python3.11/dist-packages/spatialprofilingtoolbox/cggnn/scripts/plot_interactives.py", line 39, in <module>
plot_interactives(graphs_data, feature_names, args.output_directory, args.merge_rois)
File "/usr/local/lib/python3.11/dist-packages/spatialprofilingtoolbox/cggnn/interactives.py", line 47, in plot_interactives
graphs = [_convert_dgl_to_networkx(graph, feature_names) for graph in dgl_graphs]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/spatialprofilingtoolbox/cggnn/interactives.py", line 47, in <listcomp>
graphs = [_convert_dgl_to_networkx(graph, feature_names) for graph in dgl_graphs]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/spatialprofilingtoolbox/cggnn/interactives.py", line 122, in _convert_dgl_to_networkx
raise ValueError(
ValueError: Importance scores not yet found. Calculate them and place them in graph.ndata[importance] first.
I know how to fix that one. I updated the test data artifacts for the new cg-gnn/spatialprofilingtoolbox object type split, but forgot to run a trained model on them to produce importance scores. Should be a quick fix.
After rebuilding the data loaded images and replacing the graphs.bin and graph_info.pkl, all tests pass.
I modified the documentation in several places for style conformity.
These changes finalize the transfer of responsibilities from the
cg-gnn
pip package to SPT and refactors the Nextflowcggnn
workflow to create graphs in parallel from every specimen, in the process giving Nextflow more control and insight into thecggnn
run process. In addition, thecggnn
workflow now also saves multiple model points, the graphs created, and other metadata that could be reused.