Parallelize `cggnn` workflow

CarlinLiao commented 1 year ago

These changes finalize the transfer of responsibilities from the cg-gnn pip package to SPT and refactors the Nextflow cggnn workflow to create graphs in parallel from every specimen, in the process giving Nextflow more control and insight into the cggnn run process. In addition, the cggnn workflow now also saves multiple model points, the graphs created, and other metadata that could be reused.

CarlinLiao commented 1 year ago

Will be converted to a mergeable PR once I can get home and push that commit I left on my computer. That's needed for all tests to pass since bokeh was moved from a pip cg-gnn prerequisite to actually being used in SPT itself.

That said, this PR in its current state has a working cggnn workflow. However, in order to properly evaluate it, you need to make a slight change to the cggnn_environment.yml. Instead of allowing it to install spatialprofilingtoolbox[all] from pip, comment that line out and install a local build created from this branch. Once that's done, this is what the output will look like if you use it with the CyTOF dataset on RDS:

$ spt workflow configure --local --workflow='cggnn' --study-name='Melanoma CyTOF ICI' --database-config-file='../../spt_db.config' --workflow-config-file='workflow.config'
$ ls
configure.sh  main.nf  nextflow.config  run.sh  workflow.config
$ nextflow run .
N E X T F L O W  ~  version 23.10.0
Launching `./main.nf` [insane_mayer] DSL2 - revision: 856742884f
executor >  local (35)
[09/cacb09] process > echo_environment_variables [100%] 1 of 1 ✔
[ca/03dc41] process > prep_graph_creation        [100%] 1 of 1 ✔
[96/95dde9] process > create_specimen_graphs (9) [100%] 30 of 30 ✔
[f7/92863f] process > finalize_graphs            [100%] 1 of 1 ✔
[2a/22470e] process > train                      [100%] 1 of 1 ✔
[f1/fd0f17] process > upload_importances         [100%] 1 of 1 ✔
Completed at: 16-Nov-2023 16:36:05
Duration    : 8m 26s
CPU hours   : 0.2
Succeeded   : 35

$ ls
configure.sh  main.nf  nextflow.config  results  run.sh  work  workflow.config
$ ls results/
feature_names.txt  graph_info.pkl  graphs.bin  importances.csv  model
$ ls results/model/
classification_report.txt  model_best_validation_accuracy.pt  model_best_validation_loss.pt  model_best_validation_weighted_f1_score.pt

CarlinLiao commented 1 year ago

Something to note: I originally had the cggnn workflow environment only install the cggnn and workflow extras, but this would error when running spt workflow configure because running that would require importing the phenotype proximity workflow, which imports FeatureMatrixExtractor from db, which would cause an error because it needs db extras. So, pip install spatialprofilingtoolbox[all].

CarlinLiao commented 1 year ago

Interesting error from cggnn extract.

11-16 23:41:22 [  INFO   ] db.feature_matrix_extractor: Retrieving expression data from database.
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/dist-packages/spatialprofilingtoolbox/cggnn/scripts/extract.py", line 71, in <module>
    df_cell, df_label, label_to_result = extract_cggnn_data(
                                         ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/spatialprofilingtoolbox/cggnn/extract.py", line 135, in extract_cggnn_data
    for specimen, data in extractor.extract(study=study, retain_structure_id=True).items()
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/spatialprofilingtoolbox/db/feature_matrix_extractor.py", line 80, in extract
    extraction = self._extract(
                 ^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/spatialprofilingtoolbox/db/feature_matrix_extractor.py", line 98, in _extract
    data_arrays = self._retrieve_expressions_from_database(
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/spatialprofilingtoolbox/db/feature_matrix_extractor.py", line 129, in _retrieve_expressions_from_database
    puller.pull(
  File "/usr/local/lib/python3.11/dist-packages/spatialprofilingtoolbox/workflow/common/sparse_matrix_puller.py", line 233, in pull
    self._retrieve_data_arrays(
  File "/usr/local/lib/python3.11/dist-packages/spatialprofilingtoolbox/workflow/common/sparse_matrix_puller.py", line 253, in _retrieve_data_arrays
    self._fill_data_arrays_for_study(
  File "/usr/local/lib/python3.11/dist-packages/spatialprofilingtoolbox/workflow/common/sparse_matrix_puller.py", line 276, in _fill_data_arrays_for_study
    sparse_entries = self._get_sparse_entries(
                     ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/spatialprofilingtoolbox/workflow/common/sparse_matrix_puller.py", line 355, in _get_sparse_entries
    query, parameters = self._get_sparse_matrix_query_specimen_specific(
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/spatialprofilingtoolbox/workflow/common/sparse_matrix_puller.py", line 381, in _get_sparse_matrix_query_specimen_specific
    range_definition = SparseMatrixPuller._retrieve_expressions_range(cursor, specimen)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/spatialprofilingtoolbox/workflow/common/sparse_matrix_puller.py", line 395, in _retrieve_expressions_range
    cursor.execute(query, (scope,))
psycopg2.errors.UndefinedTable: relation "range_definitions" does not exist
LINE 3:         FROM range_definitions
                     ^

jimmymathews commented 1 year ago

The range_definitions table is the replacement for the previous index, introduced in the last few days. Most likely your data images are outdated.

I am getting a different error (which may indicate accidental omission of some test data artifact?). During plotting testing:

  File "/usr/local/lib/python3.11/dist-packages/spatialprofilingtoolbox/cggnn/scripts/plot_interactives.py", line 39, in <module>
    plot_interactives(graphs_data, feature_names, args.output_directory, args.merge_rois)
  File "/usr/local/lib/python3.11/dist-packages/spatialprofilingtoolbox/cggnn/interactives.py", line 47, in plot_interactives
    graphs = [_convert_dgl_to_networkx(graph, feature_names) for graph in dgl_graphs]
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/spatialprofilingtoolbox/cggnn/interactives.py", line 47, in <listcomp>
    graphs = [_convert_dgl_to_networkx(graph, feature_names) for graph in dgl_graphs]
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/spatialprofilingtoolbox/cggnn/interactives.py", line 122, in _convert_dgl_to_networkx
    raise ValueError(
ValueError: Importance scores not yet found. Calculate them and place them in graph.ndata[importance] first.

CarlinLiao commented 1 year ago

I know how to fix that one. I updated the test data artifacts for the new cg-gnn/spatialprofilingtoolbox object type split, but forgot to run a trained model on them to produce importance scores. Should be a quick fix.

CarlinLiao commented 1 year ago

After rebuilding the data loaded images and replacing the graphs.bin and graph_info.pkl, all tests pass.

jimmymathews commented 1 year ago

I modified the documentation in several places for style conformity.

nadeemlab / SPT

Parallelize `cggnn` workflow #248