cggnn workflow Nextflow file fails to handle target names with spaces

nadeemlab / SPT

Spatial profiling toolbox for spatial characterization of tumor immune microenvironment in multiplex images

https://oncopathtk.org

Other

21 stars 2 forks source link

cggnn workflow Nextflow file fails to handle target names with spaces #241

Closed CarlinLiao closed 1 year ago

CarlinLiao commented 1 year ago

We're back to the issue I had earlier. The change to cggnn.nf errors when handling channel/phenotype name target inputs with spaces in them.

Caused by:
  Process `run_cggnn` terminated with an error exit status (2)

Command executed:

  #!/bin/bash

  strata_option=$( if [[ "1 3" != "all" ]]; then echo "--strata 1 3"; fi)
  disable_channels_option=$( if [[ "false" == "true" ]]; then echo "--disable_channels"; fi)
  disable_phenotypes_option=$( if [[ "false" = "true" ]]; then echo "--disable_phenotypes"; fi)
  target_name_option=$( if [[ "P Tumor" != "none" ]]; then echo "--target_name 'P Tumor'"; fi)
  in_ram_option=$( if [[ "true" == "true" ]]; then echo "--in_ram"; fi)
  merge_rois_option=$( if [[ "true" == "true" ]]; then echo "--merge_rois"; fi)
  prune_misclassified_option=$( if [[ "false" == "true" ]]; then echo "--prune_misclassified"; fi)
  upload_importances_option=$( if [[ "false" == "true" ]]; then echo "--upload_importances"; fi)

  spt cggnn run      --spt_db_config_location 'spt_db.config'      --study 'Melanoma intralesional IL2'      ${strata_option}      --validation_data_percent 15      --test_data_percent 0      ${disable_channels_option}      ${disable_phenotypes_option}      --cells_per_slide_target 5000      ${target_name_option}      ${in_ram_option}      --batch_size 1      --epochs 5      --learning_rate '1e-3'      --k_folds 0      --explainer_model 'pp'      ${merge_rois_option}      ${prune_misclassified_option}      --output_prefix 'miil2'      ${upload_importances_option}

Command exit status:
  2

Command output:
  (empty)

Command error:
  usage: spt cggnn run [-h] --spt_db_config_location SPT_DB_CONFIG_LOCATION --study STUDY [--strata STRATA [STRATA ...]]
                       [--validation_data_percent VALIDATION_DATA_PERCENT] [--test_data_percent TEST_DATA_PERCENT] [--disable_channels]
                       [--disable_phenotypes] [--roi_side_length ROI_SIDE_LENGTH] [--cells_per_slide_target CELLS_PER_SLIDE_TARGET]
                       [--target_name TARGET_NAME] [--in_ram] [-b BATCH_SIZE] [--epochs EPOCHS] [-l LEARNING_RATE] [-k K_FOLDS]
                       [--explainer_model EXPLAINER_MODEL] [--merge_rois] [--prune_misclassified] [--output_prefix OUTPUT_PREFIX]
                       [--upload_importances] [--random_seed RANDOM_SEED]
  spt cggnn run: error: unrecognized arguments: Tumor'

The easiest solution would be to return to the script and eval version of the command I wrote earlier, but then we'd have a mismatch between the cggnn Nextflow file and the other ones.

CarlinLiao commented 1 year ago

That said, a run I queued up yesterday using a pre-merge branch of SPT on the Melanoma IL2 dataset both took longer than I expected and errored further downstream. Does this have to do with the removal of indexing you mentioned yesterday @jimmymathews?

Command executed:

  run="spt cggnn run "
  run+="--spt_db_config_location 'spt_db.config' "
  run+="--study 'Melanoma intralesional IL2' "
  run+=$(if [[ "1 3" != all ]]; then echo "--strata 1 3 "; fi)
  run+="--validation_data_percent 15 "
  run+="--test_data_percent 0 "
  run+=$(if [[ "false" = true ]]; then echo "--disable_channels "; fi)
  run+=$(if [[ "false" = true ]]; then echo "--disable_phenotypes "; fi)
  run+="--cells_per_slide_target 5000 "
  run+=$(if [[ "P Tumor" != none ]]; then echo '--target_name "P Tumor" ' ; fi)
  run+=$(if [[ "true" = true ]]; then echo "--in_ram "; fi)     run+="--batch_size 1 "
  run+="--epochs 5 "
  run+="--learning_rate 1e-3 "
  run+="--k_folds 0 "
  run+="--explainer_model "pp" "
  run+=$(if [[ "true" = true ]]; then echo "--merge_rois "; fi)
  run+=$(if [[ "false" = true ]]; then echo "--prune_misclassified "; fi)
  run+="--output_prefix "miil2" "
  run+=$(if [[ "false" = true ]]; then echo "--upload_importances "; fi)
  eval $run

Command exit status:
  1

Command output:
  (empty)

Command error:
  [34m11-07 19:48:34 [0m[35m[  [0mDEBUG[0m[35m  ] workflow.common.structure_centroids_puller:[0m[34m96[35m: [0mReceived 100000 shapefiles entries from DB.
  [34m11-07 19:48:34 [0m[35m[  [0mDEBUG[0m[35m  ] workflow.common.structure_centroids_puller:[0m[34m96[35m: [0mReceived 98336 shapefiles entries from DB.
  [34m11-07 19:48:44 [0m[35m[  [0m[32;1mINFO[0m[35m   ] db.feature_matrix_extractor: [0mDone retrieving centroids.
  [34m11-07 19:48:45 [0m[35m[  [0m[32;1mINFO[0m[35m   ] db.feature_matrix_extractor: [0mRetrieving phenotypes from database.
  [34m11-07 19:48:46 [0m[35m[  [0m[32;1mINFO[0m[35m   ] db.feature_matrix_extractor: [0mDone retrieving phenotypes.
  [34m11-07 19:48:46 [0m[35m[  [0m[32;1mINFO[0m[35m   ] db.feature_matrix_extractor: [0mAggregating channel information for one study.
  [34m11-07 19:48:46 [0m[35m[  [0m[32;1mINFO[0m[35m   ] db.feature_matrix_extractor: [0mDone aggregating channel information.
  [34m11-07 19:48:46 [0m[35m[  [0m[32;1mINFO[0m[35m   ] db.feature_matrix_extractor: [0mCreating feature matrices from binary data arrays and centroids.
  [34m11-07 19:48:46 [0m[35m[  [0mDEBUG[0m[35m  ] db.feature_matrix_extractor:[0m[34m180[35m: [0mSpecimen lesion 0_1 .
  Traceback (most recent call last):
    File "/home/liaoc2/miniconda3/envs/spt_cggnn/lib/python3.11/site-packages/pandas/core/internals/construction.py", line 934, in _finalize_columns_and_data
      columns = _validate_or_indexify_columns(contents, columns)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/home/liaoc2/miniconda3/envs/spt_cggnn/lib/python3.11/site-packages/pandas/core/internals/construction.py", line 981, in _validate_or_indexify_columns
      raise AssertionError(
  AssertionError: 28 columns passed, passed data had 54 columns

  The above exception was the direct cause of the following exception:

  Traceback (most recent call last):
    File "/home/liaoc2/miniconda3/envs/spt_cggnn/lib/python3.11/site-packages/spatialprofilingtoolbox/cggnn/scripts/run.py", line 184, in <module>
      df_cell, df_label, label_to_result = extract_cggnn_data(
                                           ^^^^^^^^^^^^^^^^^^^
    File "/home/liaoc2/miniconda3/envs/spt_cggnn/lib/python3.11/site-packages/spatialprofilingtoolbox/cggnn/extract.py", line 130, in extract_cggnn_data
      df_cell = _create_cell_df({
                                ^
    File "/home/liaoc2/miniconda3/envs/spt_cggnn/lib/python3.11/site-packages/spatialprofilingtoolbox/cggnn/extract.py", line 131, in <dictcomp>
      specimen: extractor.extract(specimen=specimen, retain_structure_id=True)[specimen].dataframe
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/home/liaoc2/miniconda3/envs/spt_cggnn/lib/python3.11/site-packages/spatialprofilingtoolbox/db/feature_matrix_extractor.py", line 80, in extract
      extraction = self._extract(
                   ^^^^^^^^^^^^^^
    File "/home/liaoc2/miniconda3/envs/spt_cggnn/lib/python3.11/site-packages/spatialprofilingtoolbox/db/feature_matrix_extractor.py", line 113, in _extract
      return self._create_feature_matrices(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/home/liaoc2/miniconda3/envs/spt_cggnn/lib/python3.11/site-packages/spatialprofilingtoolbox/db/feature_matrix_extractor.py", line 193, in _create_feature_matrices
      dataframe = DataFrame(
                  ^^^^^^^^^^
    File "/home/liaoc2/miniconda3/envs/spt_cggnn/lib/python3.11/site-packages/pandas/core/frame.py", line 782, in __init__
      arrays, columns, index = nested_data_to_arrays(
                               ^^^^^^^^^^^^^^^^^^^^^^
    File "/home/liaoc2/miniconda3/envs/spt_cggnn/lib/python3.11/site-packages/pandas/core/internals/construction.py", line 498, in nested_data_to_arrays
      arrays, columns = to_arrays(data, columns, dtype=dtype)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/home/liaoc2/miniconda3/envs/spt_cggnn/lib/python3.11/site-packages/pandas/core/internals/construction.py", line 840, in to_arrays
      content, columns = _finalize_columns_and_data(arr, columns, dtype)
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/home/liaoc2/miniconda3/envs/spt_cggnn/lib/python3.11/site-packages/pandas/core/internals/construction.py", line 937, in _finalize_columns_and_data
      raise ValueError(err) from err
  ValueError: 28 columns passed, passed data had 54 columns

CarlinLiao commented 1 year ago

We should change the cggnn workflow test to have a non-null --target_variable value. Since the test dataset is so small, it should be a channel or phenotype that is almost fully or fully represented in the test specimens, instead of P Tumor like it usually is. If I recall correctly, one of the test slides had only a single cell with P Tumor, which made it too small to create any ROIs/patches from.

jimmymathews commented 1 year ago

The traceback shows that the optional arguments are going through one last round of variable expansion by bash; the command seen just before being passed to bash interpretation of the line is:

spt cgnn run ... ${target_name_option} ...

Unfortunately single quote delimitation inside variables does not survive the entire process to end up as command line arguments to executables (even though the variable itself does correctly serve up the single quotes in the string). This issue is easily reproducible:

function f() { echo "Received first argument: $1."; echo "Received second argument: $2."; }
arg=$( echo "value='my val'" )

echo $arg

value='my val'

f $arg

Received first argument: value='my.
Received second argument: val'.

The sequence of strings that are provided as CLI arguments is determined by all the expansion steps performed to arrive at the line's contents. Bash is not able to use single quote delimiters after doing the variable expansion (it uses these delimiters first).

Double quotes are saved for last, however. So we can save this example by double quoting at the last moment to provide the hint about what we want to stay together as a single argument:

f "$arg"

Received first argument: value='my val'.
Received second argument: .

So we need to get it to say:

spt cgnn run ... "${target_name_option}" ...

CarlinLiao commented 1 year ago

I think I was able to circumvent this process by using eval, which from my experience treats the variable as a string that's printed to console, so that the '"P' 'Tumor"' ends up resolving to "P Tumor" before being interpreted as a command. Does that match your understanding, and if so would that be an amenable solution?

jimmymathews commented 1 year ago

No, in my opinion eval is a poor solution. It is prone to many bugs. If there are slight errors in the command, it wlll be "masked". This is just the general practice/pattern not to call code as raw strings whenever possible. Did you read my comment? The fix is very simple.

CarlinLiao commented 1 year ago

I'm not sure I understood it. Or, rather, I understand that we simply need to get it to say

spt cgnn run ... "${target_name_option}" ...

but I don't understand if we can get there in the Nextflow script and if so how. On Monday I was trying all manner of single, double, triple, and quadruple escaping quotation marks to get at a command line call that preserved the double quotation marks without adding new single quotation marks, to no avail.

CarlinLiao commented 1 year ago

Working through several permutations in sequence

    script:
    """
    #!/bin/bash
    set -x
    target_name_option=\$( if [[ "${target_name}" != "none" ]]; then echo "--target_name '${target_name}'"; fi)
    spt cggnn run \
     "\${target_name_option}"
    """

Gives

Command executed:

  #!/bin/bash
  set -x
  target_name_option=$( if [[ "P Tumor" != "none" ]]; then echo "--target_name 'P Tumor'"; fi)
  spt cggnn run      "${target_name_option}"

Command exit status:
  2

Command output:
  (empty)

Command error:
  ++ [[ P Tumor != \n\o\n\e ]]
  ++ echo '--target_name '\''P Tumor'\'''
  + target_name_option='--target_name '\''P Tumor'\'''
  + spt cggnn run '--target_name '\''P Tumor'\'''

From here on I'll call out only the lines changed in each run sequence.

Nextflow: spt cggnn run \"\${target_name_option}\"
Executed: spt cggnn run "${target_name_option}"
set -x:   spt cggnn run '--target_name '\''P Tumor'\'''

Nextflow: spt cggnn run \\\"\${target_name_option}\\\"
Executed: spt cggnn run \"${target_name_option}\"
set-x:   spt cggnn run '"--target_name' ''\''P' 'Tumor'\''"'

It would appear that even when I am able to Nextflow to assemble the command as

spt cgnn run ... "${target_name_option}" ...

The actual target_name_option is processed to be surrounded by single quotation marks.

A few more for completeness:

Nextflow: target_name_option=\$( if [[ "${target_name}" != "none" ]]; then echo "--target_name "${target_name}""; fi)
          spt cggnn run \"\${target_name_option}\"
Executed: target_name_option=$( if [[ "P Tumor" != "none" ]]; then echo "--target_name "P Tumor""; fi)
          spt cggnn run "${target_name_option}"
set -x:   target_name_option='--target_name P Tumor'
          spt cggnn run '--target_name P Tumor'

Nextflow: target_name_option=\$( if [[ "${target_name}" != "none" ]]; then echo "--target_name \"${target_name}\""; fi)
          spt cggnn run \"\${target_name_option}\"
Executed: target_name_option=$( if [[ "P Tumor" != "none" ]]; then echo "--target_name "P Tumor""; fi)
          spt cggnn run "${target_name_option}"
set -x:   target_name_option='--target_name P Tumor'
          spt cggnn run '--target_name P Tumor'

Nextflow: target_name_option=\$( if [[ "${target_name}" != "none" ]]; then echo "--target_name \\"${target_name}\\""; fi)
          spt cggnn run \"\${target_name_option}\"
Executed: target_name_option=$( if [[ "P Tumor" != "none" ]]; then echo "--target_name \"P Tumor\""; fi)
          spt cggnn run "${target_name_option}"
set -x:   target_name_option='--target_name "P Tumor"'
          spt cggnn run '--target_name "P Tumor"'

Nextflow: target_name_option=\$( if [[ "${target_name}" != "none" ]]; then echo "--target_name \\"${target_name}\\""; fi)
          spt cggnn run \${target_name_option}
Executed: target_name_option=$( if [[ "P Tumor" != "none" ]]; then echo "--target_name \"P Tumor\""; fi)
          spt cggnn run ${target_name_option}
set -x:   target_name_option='--target_name "P Tumor"'
          spt cggnn run --target_name '"P' 'Tumor"'

Nextflow: target_name_option=\$( if [[ "${target_name}" != "none" ]]; then echo "--target_name \\"${target_name}\\""; fi)
          spt cggnn run "\${target_name_option}"
Executed: target_name_option=$( if [[ "P Tumor" != "none" ]]; then echo "--target_name \"P Tumor\""; fi)
          spt cggnn run "${target_name_option}"
set -x:   target_name_option='--target_name "P Tumor"'
          spt cggnn run '--target_name "P Tumor"'

jimmymathews commented 1 year ago

What branch should I look in? There is no issue241 branch.

CarlinLiao commented 1 year ago

There is no branch; if anything, we branch off main. These results are from me experimenting by editing a Nextflow file generated by spt workflow configure.

jimmymathews commented 1 year ago

bash is still eventually used as the shell/interpreter. We can check on bash behavior fairly reliably in local tests. So you could try to reproduce this in a simple setup (along the lines of my comment above), until it works in the simple setup. Then you can try to reproduce the -- working -- simple setup in the NF script. I recommend this over arbitrary string delimiter attempts.

Another consideration is that the script block is technically a Groovy string that is then passed somehow to bash later.

jimmymathews commented 1 year ago

Very likely there is a subtlety regarding the shell expansion orders: https://www.gnu.org/software/bash/manual/bash.html#Shell-Expansions

Reading this now.

jimmymathews commented 1 year ago

There is a seemingly pretty good answer to a question here on stackexchange which is I think about our situation of trying to get bash to retain quoting for a while to prepare a statement: https://superuser.com/questions/1529226/get-bash-to-respect-quotes-when-word-splitting-subshell-output

(The answer includes an admonishment against eval...)

jimmymathews commented 1 year ago

The version I just pushed to issue241 implements that answer's suggestion to use xargs. The quoting is a bit elaborate (so, not as easy as I thought; I thought it would just be double quotes in the last call). But it seems to work and avoids evil eval.

jimmymathews commented 1 year ago

Still testing in the case of non-trivial target name.

jimmymathews commented 1 year ago

I think the issue is resolved. I altered spt cggnn run to parse the target name option no matter what, interpreting "none", pre-empting the need for conditional behavior in crafting the command line call in this case.

I tested with a different value of target_name, and the test still fails but the command line crafting portion seems to succeed.