Closed CarlinLiao closed 1 year ago
That said, a run I queued up yesterday using a pre-merge branch of SPT on the Melanoma IL2 dataset both took longer than I expected and errored further downstream. Does this have to do with the removal of indexing you mentioned yesterday @jimmymathews?
Command executed:
run="spt cggnn run "
run+="--spt_db_config_location 'spt_db.config' "
run+="--study 'Melanoma intralesional IL2' "
run+=$(if [[ "1 3" != all ]]; then echo "--strata 1 3 "; fi)
run+="--validation_data_percent 15 "
run+="--test_data_percent 0 "
run+=$(if [[ "false" = true ]]; then echo "--disable_channels "; fi)
run+=$(if [[ "false" = true ]]; then echo "--disable_phenotypes "; fi)
run+="--cells_per_slide_target 5000 "
run+=$(if [[ "P Tumor" != none ]]; then echo '--target_name "P Tumor" ' ; fi)
run+=$(if [[ "true" = true ]]; then echo "--in_ram "; fi) run+="--batch_size 1 "
run+="--epochs 5 "
run+="--learning_rate 1e-3 "
run+="--k_folds 0 "
run+="--explainer_model "pp" "
run+=$(if [[ "true" = true ]]; then echo "--merge_rois "; fi)
run+=$(if [[ "false" = true ]]; then echo "--prune_misclassified "; fi)
run+="--output_prefix "miil2" "
run+=$(if [[ "false" = true ]]; then echo "--upload_importances "; fi)
eval $run
Command exit status:
1
Command output:
(empty)
Command error:
[34m11-07 19:48:34 [0m[35m[ [0mDEBUG[0m[35m ] workflow.common.structure_centroids_puller:[0m[34m96[35m: [0mReceived 100000 shapefiles entries from DB.
[34m11-07 19:48:34 [0m[35m[ [0mDEBUG[0m[35m ] workflow.common.structure_centroids_puller:[0m[34m96[35m: [0mReceived 98336 shapefiles entries from DB.
[34m11-07 19:48:44 [0m[35m[ [0m[32;1mINFO[0m[35m ] db.feature_matrix_extractor: [0mDone retrieving centroids.
[34m11-07 19:48:45 [0m[35m[ [0m[32;1mINFO[0m[35m ] db.feature_matrix_extractor: [0mRetrieving phenotypes from database.
[34m11-07 19:48:46 [0m[35m[ [0m[32;1mINFO[0m[35m ] db.feature_matrix_extractor: [0mDone retrieving phenotypes.
[34m11-07 19:48:46 [0m[35m[ [0m[32;1mINFO[0m[35m ] db.feature_matrix_extractor: [0mAggregating channel information for one study.
[34m11-07 19:48:46 [0m[35m[ [0m[32;1mINFO[0m[35m ] db.feature_matrix_extractor: [0mDone aggregating channel information.
[34m11-07 19:48:46 [0m[35m[ [0m[32;1mINFO[0m[35m ] db.feature_matrix_extractor: [0mCreating feature matrices from binary data arrays and centroids.
[34m11-07 19:48:46 [0m[35m[ [0mDEBUG[0m[35m ] db.feature_matrix_extractor:[0m[34m180[35m: [0mSpecimen lesion 0_1 .
Traceback (most recent call last):
File "/home/liaoc2/miniconda3/envs/spt_cggnn/lib/python3.11/site-packages/pandas/core/internals/construction.py", line 934, in _finalize_columns_and_data
columns = _validate_or_indexify_columns(contents, columns)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/liaoc2/miniconda3/envs/spt_cggnn/lib/python3.11/site-packages/pandas/core/internals/construction.py", line 981, in _validate_or_indexify_columns
raise AssertionError(
AssertionError: 28 columns passed, passed data had 54 columns
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/liaoc2/miniconda3/envs/spt_cggnn/lib/python3.11/site-packages/spatialprofilingtoolbox/cggnn/scripts/run.py", line 184, in <module>
df_cell, df_label, label_to_result = extract_cggnn_data(
^^^^^^^^^^^^^^^^^^^
File "/home/liaoc2/miniconda3/envs/spt_cggnn/lib/python3.11/site-packages/spatialprofilingtoolbox/cggnn/extract.py", line 130, in extract_cggnn_data
df_cell = _create_cell_df({
^
File "/home/liaoc2/miniconda3/envs/spt_cggnn/lib/python3.11/site-packages/spatialprofilingtoolbox/cggnn/extract.py", line 131, in <dictcomp>
specimen: extractor.extract(specimen=specimen, retain_structure_id=True)[specimen].dataframe
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/liaoc2/miniconda3/envs/spt_cggnn/lib/python3.11/site-packages/spatialprofilingtoolbox/db/feature_matrix_extractor.py", line 80, in extract
extraction = self._extract(
^^^^^^^^^^^^^^
File "/home/liaoc2/miniconda3/envs/spt_cggnn/lib/python3.11/site-packages/spatialprofilingtoolbox/db/feature_matrix_extractor.py", line 113, in _extract
return self._create_feature_matrices(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/liaoc2/miniconda3/envs/spt_cggnn/lib/python3.11/site-packages/spatialprofilingtoolbox/db/feature_matrix_extractor.py", line 193, in _create_feature_matrices
dataframe = DataFrame(
^^^^^^^^^^
File "/home/liaoc2/miniconda3/envs/spt_cggnn/lib/python3.11/site-packages/pandas/core/frame.py", line 782, in __init__
arrays, columns, index = nested_data_to_arrays(
^^^^^^^^^^^^^^^^^^^^^^
File "/home/liaoc2/miniconda3/envs/spt_cggnn/lib/python3.11/site-packages/pandas/core/internals/construction.py", line 498, in nested_data_to_arrays
arrays, columns = to_arrays(data, columns, dtype=dtype)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/liaoc2/miniconda3/envs/spt_cggnn/lib/python3.11/site-packages/pandas/core/internals/construction.py", line 840, in to_arrays
content, columns = _finalize_columns_and_data(arr, columns, dtype)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/liaoc2/miniconda3/envs/spt_cggnn/lib/python3.11/site-packages/pandas/core/internals/construction.py", line 937, in _finalize_columns_and_data
raise ValueError(err) from err
ValueError: 28 columns passed, passed data had 54 columns
We should change the cggnn workflow test to have a non-null --target_variable
value. Since the test dataset is so small, it should be a channel or phenotype that is almost fully or fully represented in the test specimens, instead of P Tumor
like it usually is. If I recall correctly, one of the test slides had only a single cell with P Tumor
, which made it too small to create any ROIs/patches from.
The traceback shows that the optional arguments are going through one last round of variable expansion by bash; the command seen just before being passed to bash interpretation of the line is:
spt cgnn run ... ${target_name_option} ...
Unfortunately single quote delimitation inside variables does not survive the entire process to end up as command line arguments to executables (even though the variable itself does correctly serve up the single quotes in the string). This issue is easily reproducible:
function f() { echo "Received first argument: $1."; echo "Received second argument: $2."; }
arg=$( echo "value='my val'" )
echo $arg
value='my val'
f $arg
Received first argument: value='my.
Received second argument: val'.
The sequence of strings that are provided as CLI arguments is determined by all the expansion steps performed to arrive at the line's contents. Bash is not able to use single quote delimiters after doing the variable expansion (it uses these delimiters first).
Double quotes are saved for last, however. So we can save this example by double quoting at the last moment to provide the hint about what we want to stay together as a single argument:
f "$arg"
Received first argument: value='my val'.
Received second argument: .
So we need to get it to say:
spt cgnn run ... "${target_name_option}" ...
I think I was able to circumvent this process by using eval
, which from my experience treats the variable as a string that's printed to console, so that the '"P' 'Tumor"'
ends up resolving to "P Tumor"
before being interpreted as a command. Does that match your understanding, and if so would that be an amenable solution?
No, in my opinion eval is a poor solution. It is prone to many bugs. If there are slight errors in the command, it wlll be "masked". This is just the general practice/pattern not to call code as raw strings whenever possible. Did you read my comment? The fix is very simple.
I'm not sure I understood it. Or, rather, I understand that we simply need to get it to say
spt cgnn run ... "${target_name_option}" ...
but I don't understand if we can get there in the Nextflow script and if so how. On Monday I was trying all manner of single, double, triple, and quadruple escaping quotation marks to get at a command line call that preserved the double quotation marks without adding new single quotation marks, to no avail.
Working through several permutations in sequence
script:
"""
#!/bin/bash
set -x
target_name_option=\$( if [[ "${target_name}" != "none" ]]; then echo "--target_name '${target_name}'"; fi)
spt cggnn run \
"\${target_name_option}"
"""
Gives
Command executed:
#!/bin/bash
set -x
target_name_option=$( if [[ "P Tumor" != "none" ]]; then echo "--target_name 'P Tumor'"; fi)
spt cggnn run "${target_name_option}"
Command exit status:
2
Command output:
(empty)
Command error:
++ [[ P Tumor != \n\o\n\e ]]
++ echo '--target_name '\''P Tumor'\'''
+ target_name_option='--target_name '\''P Tumor'\'''
+ spt cggnn run '--target_name '\''P Tumor'\'''
From here on I'll call out only the lines changed in each run sequence.
Nextflow: spt cggnn run \"\${target_name_option}\"
Executed: spt cggnn run "${target_name_option}"
set -x: spt cggnn run '--target_name '\''P Tumor'\'''
Nextflow: spt cggnn run \\\"\${target_name_option}\\\"
Executed: spt cggnn run \"${target_name_option}\"
set-x: spt cggnn run '"--target_name' ''\''P' 'Tumor'\''"'
It would appear that even when I am able to Nextflow to assemble the command as
spt cgnn run ... "${target_name_option}" ...
The actual target_name_option
is processed to be surrounded by single quotation marks.
A few more for completeness:
Nextflow: target_name_option=\$( if [[ "${target_name}" != "none" ]]; then echo "--target_name "${target_name}""; fi)
spt cggnn run \"\${target_name_option}\"
Executed: target_name_option=$( if [[ "P Tumor" != "none" ]]; then echo "--target_name "P Tumor""; fi)
spt cggnn run "${target_name_option}"
set -x: target_name_option='--target_name P Tumor'
spt cggnn run '--target_name P Tumor'
Nextflow: target_name_option=\$( if [[ "${target_name}" != "none" ]]; then echo "--target_name \"${target_name}\""; fi)
spt cggnn run \"\${target_name_option}\"
Executed: target_name_option=$( if [[ "P Tumor" != "none" ]]; then echo "--target_name "P Tumor""; fi)
spt cggnn run "${target_name_option}"
set -x: target_name_option='--target_name P Tumor'
spt cggnn run '--target_name P Tumor'
Nextflow: target_name_option=\$( if [[ "${target_name}" != "none" ]]; then echo "--target_name \\"${target_name}\\""; fi)
spt cggnn run \"\${target_name_option}\"
Executed: target_name_option=$( if [[ "P Tumor" != "none" ]]; then echo "--target_name \"P Tumor\""; fi)
spt cggnn run "${target_name_option}"
set -x: target_name_option='--target_name "P Tumor"'
spt cggnn run '--target_name "P Tumor"'
Nextflow: target_name_option=\$( if [[ "${target_name}" != "none" ]]; then echo "--target_name \\"${target_name}\\""; fi)
spt cggnn run \${target_name_option}
Executed: target_name_option=$( if [[ "P Tumor" != "none" ]]; then echo "--target_name \"P Tumor\""; fi)
spt cggnn run ${target_name_option}
set -x: target_name_option='--target_name "P Tumor"'
spt cggnn run --target_name '"P' 'Tumor"'
Nextflow: target_name_option=\$( if [[ "${target_name}" != "none" ]]; then echo "--target_name \\"${target_name}\\""; fi)
spt cggnn run "\${target_name_option}"
Executed: target_name_option=$( if [[ "P Tumor" != "none" ]]; then echo "--target_name \"P Tumor\""; fi)
spt cggnn run "${target_name_option}"
set -x: target_name_option='--target_name "P Tumor"'
spt cggnn run '--target_name "P Tumor"'
What branch should I look in? There is no issue241
branch.
There is no branch; if anything, we branch off main. These results are from me experimenting by editing a Nextflow file generated by spt workflow configure
.
bash is still eventually used as the shell/interpreter. We can check on bash behavior fairly reliably in local tests. So you could try to reproduce this in a simple setup (along the lines of my comment above), until it works in the simple setup. Then you can try to reproduce the -- working -- simple setup in the NF script. I recommend this over arbitrary string delimiter attempts.
Another consideration is that the script block is technically a Groovy string that is then passed somehow to bash later.
Very likely there is a subtlety regarding the shell expansion orders: https://www.gnu.org/software/bash/manual/bash.html#Shell-Expansions
Reading this now.
There is a seemingly pretty good answer to a question here on stackexchange which is I think about our situation of trying to get bash to retain quoting for a while to prepare a statement: https://superuser.com/questions/1529226/get-bash-to-respect-quotes-when-word-splitting-subshell-output
(The answer includes an admonishment against eval
...)
The version I just pushed to issue241 implements that answer's suggestion to use xargs. The quoting is a bit elaborate (so, not as easy as I thought; I thought it would just be double quotes in the last call). But it seems to work and avoids evil eval.
Still testing in the case of non-trivial target name.
I think the issue is resolved. I altered spt cggnn run
to parse the target name option no matter what, interpreting "none", pre-empting the need for conditional behavior in crafting the command line call in this case.
I tested with a different value of target_name, and the test still fails but the command line crafting portion seems to succeed.
We're back to the issue I had earlier. The change to
cggnn.nf
errors when handling channel/phenotype name target inputs with spaces in them.The easiest solution would be to return to the
script
andeval
version of the command I wrote earlier, but then we'd have a mismatch between the cggnn Nextflow file and the other ones.