Closed CarlinLiao closed 1 year ago
Added a test, so this will be good to merge once we can get it to pass.
The strange thing is, during the nextflow run, it errors and complains that torch is missing, even though the workflow
Docker image is now built off of a torch image. Do the nextflow instances not run on the workflow
Docker image? The other nextflow configurations don't need to do any special installations!
┃ cggnn ┈┈┈
(Started recording Docker compose logs for this test.)
process.executor = 'local'
env.workflow_ = 'cggnn'
env.db_config_file_ = '../db/.spt_db.config.container'
env.study_name_ = 'Melanoma intralesional IL2'
env.strata_ = '[5 7]'
env.validation_data_percent_ = '15'
env.test_data_percent_ = '0'
env.disable_channels_ = 'false'
env.disable_phenotypes_ = 'false'
env.cells_per_slide_target_ = '5000'
env.target_name_ = 'P Tumor'
env.in_ram_ = 'true'
env.batch_size_ = '1'
env.epochs_ = '5'
env.learning_rate_ = '1e-3'
env.k_folds_ = '0'
env.explainer_model_ = 'pp'
env.merge_rois_ = 'true'
env.prune_misclassified_ = 'false'
env.output_prefix_ = 'mi2'
env.upload_importances_ = 'false'N E X T F L O W ~ version 23.10.0
Launching `./main.nf` [fervent_galileo] DSL2 - revision: fd0ce3c646
executor > local (2)
[07/45f395] process > echo_environment_variables [100%] 1 of 1 ✔
[b6/ea7bff] process > run_cggnn [ 0%] 0 of 1
ERROR ~ Error executing process > 'run_cggnn'
Caused by:
Process `run_cggnn` terminated with an error exit status (1)
Command executed:
spt cggnn run --spt_db_config_location=".spt_db.config.container" --study="Melanoma intralesional IL2" --strata=[5 7] --validation-data-percent=15 --test-data-percent=0 $(if [[ "false" = true ]]; then echo "--disable-channels"; fi) $(if [[ "false" = true ]]; then echo "--disable_phenotypes"; fi) --cells-per-slide-target=5000 --target-name="P Tumor" $(if [[ "true" = true ]]; then echo "--in_ram"; fi) --batch-size=1 --epochs=5 --learning-rate=1e-3 --k-folds=0 --explainer-model="pp" $(if [[ "true" = true ]]; then echo "--merge_rois"; fi) $(if [[ "false" = true ]]; then echo "--prune_misclassified"; fi) --output-prefix="mi2" $(if [[ "false" = true ]]; then echo "--upload_importances"; fi)
Command exit status:
1
Command output:
Got a module not found error.
Did you install the required extras with:
pip install "spatialprofilingtoolbox[cggnn]"
?
Command error:
Got a module not found error.
Did you install the required extras with:
pip install "spatialprofilingtoolbox[cggnn]"
?
Traceback (most recent call last):
File "/usr/local/lib/python3.11/dist-packages/spatialprofilingtoolbox/cggnn/scripts/run.py", line 15, in <module>
SuggestExtrasException(e, 'cggnn')
File "/usr/local/lib/python3.11/dist-packages/spatialprofilingtoolbox/standalone_utilities/module_load_error.py", line 7, in __init__
executor > local (2)
[07/45f395] process > echo_environment_variables [100%] 1 of 1 ✔
[b6/ea7bff] process > run_cggnn [100%] 1 of 1, failed: 1 ✘
ERROR ~ Error executing process > 'run_cggnn'
Caused by:
Process `run_cggnn` terminated with an error exit status (1)
Command executed:
spt cggnn run --spt_db_config_location=".spt_db.config.container" --study="Melanoma intralesional IL2" --strata=[5 7] --validation-data-percent=15 --test-data-percent=0 $(if [[ "false" = true ]]; then echo "--disable-channels"; fi) $(if [[ "false" = true ]]; then echo "--disable_phenotypes"; fi) --cells-per-slide-target=5000 --target-name="P Tumor" $(if [[ "true" = true ]]; then echo "--in_ram"; fi) --batch-size=1 --epochs=5 --learning-rate=1e-3 --k-folds=0 --explainer-model="pp" $(if [[ "true" = true ]]; then echo "--merge_rois"; fi) $(if [[ "false" = true ]]; then echo "--prune_misclassified"; fi) --output-prefix="mi2" $(if [[ "false" = true ]]; then echo "--upload_importances"; fi)
Command exit status:
1
Command output:
Got a module not found error.
Did you install the required extras with:
pip install "spatialprofilingtoolbox[cggnn]"
?
Command error:
Got a module not found error.
Did you install the required extras with:
pip install "spatialprofilingtoolbox[cggnn]"
?
Traceback (most recent call last):
File "/usr/local/lib/python3.11/dist-packages/spatialprofilingtoolbox/cggnn/scripts/run.py", line 15, in <module>
SuggestExtrasException(e, 'cggnn')
File "/usr/local/lib/python3.11/dist-packages/spatialprofilingtoolbox/standalone_utilities/module_load_error.py", line 7, in __init__
raise module_not_found_error
File "/usr/local/lib/python3.11/dist-packages/spatialprofilingtoolbox/cggnn/scripts/run.py", line 12, in <module>
from cggnn.run import run
File "/usr/local/lib/python3.11/dist-packages/cggnn/__init__.py", line 3, in <module>
from cggnn.generate_graphs import generate_graphs
File "/usr/local/lib/python3.11/dist-packages/cggnn/generate_graphs.py", line 9, in <module>
from torch import Tensor, FloatTensor, IntTensor # pylint: disable=no-name-in-module
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ModuleNotFoundError: No module named 'torch'
Work dir:
/mount_sources/workflow/work/b6/ea7bffd8f25b843fd0685b32209536
Tip: when you have fixed the problem you can continue the execution adding the option `-resume` to the run command line
-- Check '.nextflow.log' file for details
head: cannot open 'results/mi2_importances.csv' for reading: No such file or directory
Error during nextflow run.
Docker compose logs:
no such service: testing-api-server
no such service: testing-fast-counts-server
┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈ Failed. (48s)
I've made a few fixes that I'll explain below, but I need help with one specific issue.
The cggnn workflow test also doesn't pass because, when running the cg-gnn pipeline using Nextflow, it throws this error: ValueError: Specimen "lesion 0_2" not found in study "Melanoma intralesional IL2 - measurement".
After inspection, this is because workflow uses test dataset 3, AKA spt-db-preloaded-1small
, which has a samples.tsv
that contains 7 specimens, even though its file_manifest.tsv
only contains 2. I've removed all but the 2 specimens in the file_manifest.tsv
from test dataset 3, including lesion 0_2
, but it's still throwing the same error. I think I've fully cleaned out my cache, so I'm at a loss as to why it's still looking for lesion 0_2
.
As we discussed in person, not all Docker images we specified were still being used, and there was some confusion on my end about which images were being used for what. In particular, I had to learn that the workflow image wasn't being used for workflow tests, and instead the general development image was. Given that, I was able to solve the issue since the general development Docker container didn't have torch and DGL installed, which I rectified.
Pursuing this thread, I removed two Dockerfiles we concluded weren't being used: workflow and cggnn. Removing the former didn't break any tests, but the latter did. Upon further inspection, the cggnn image was being used by the cggnn tests. I've chosen to change the tests such that they're run using the general development container instead of the cggnn specific one since both are basically the same now that the development image needs the (very large and heavy) pytorch and DGL libraries installed, although I would also see why we'd prefer to keep the cggnn image separate.
VERBOSE
and uploading to the feature specification table, the debug logs two lines for every single line uploaded. I don't think reporting on each addition to the table is necessary.build
and not test
, e.g., the expected_counts
tables.
test
folder, and build
only contains the subset of files needed to compile SPT or meant to be published to, say, a public Docker repo.The dataset used for workflow testing was gradually slimmed down in order to make the tests run in less time. At one point this was accomplished by reducing the set of samples that end up in the "specimen data measurement study", albeit without removing them from the list that includes them in the "specimen collection study". So the discrepancy between the 2 vs 7 samples is somewhat deliberate, and it furnishes a test of the data import functionality's capability of distinguishing between samples that were collected and have some metadata versus ones that went all the way through the measurement process.
The error you observed there should be pretty easy to track down by looking at the source code starting from the traceback reference or log message reference. A quick grep starts this search:
$ grep -r -n 'not found in study' spatialprofilingtoolbox/*
spatialprofilingtoolbox/workflow/common/sparse_matrix_puller.py:305: raise ValueError(f'Specimen
"{specimen}" not found in study "{measurement_study}".')
So, here.
This is fine.
O(kn^2)
where k
is the number of radius levels (2) and n
is the number of channels or phenotypes (26 + 4). Though k
and n
are among the small cardinalities, the order is cubic in these, which can be too large. In this case around 1800, which is indeed too many. The large number of these outputs is actually the reason that this workflow is not part of our default pipeline to support the main application, it was the impetus for the ondemand
capabilities. However, it is the feature exporter (not the proximity workflow directly) which is generating these messages, and I do think it is a desireable default for each feature upload to register a log message for the purpose of auditing and debugging. So what is needed is a workflow-specific silencer, i.e. a quiet
flag in ADIFeaturesUploader
that will either silence or summarize the message generated by logger.debug('Inserted feature specification, "%s".', specifiers)
and logger.debug('Inserted %s feature values.', len(feature_values))
.expected_counts_1and2.txt
is used to verify that the preloaded data image builds correctly, in import_test_dataset1and2.sh
. This is why it is in build/
, along with this import script. The other expected_counts
, referenced by tests, are in test/
.test/
and build/
conventions were initiated by @CarlinLiao in #103. If I understand your comment correctly you are suggesting further subdividing build/
into things related to published docker images and things related to testing-only docker images. I don't think this is a high priority, but the logic makes sense. You could make a separate issue for this if you'd like, as this doesn't have much to do with adding the cggnn workflow.ADIFeaturesUploader
.Upon further inspection, this error is due to how spt cggnn
extracts single cell data. Here's what's happening:
FeatureMatrixExtractor.extract_cohorts(study)
to pull data on all cohorts, including what specimens are associated with each cohort/strata.All of this has worked when I run spt cggnn
manually outside of the test environment so far because every specimen in every study currently in the live database exists in both the "specimen data measurement study" and "specimen collection study".
I was confused as to why deleting the specimens missing cell manifests from samples.tsv
didn't fix the error, since then dataset3
should've no longer had a concept of a "lesion 0_2" at all. Upon further inspection of the build/import_test_dataset
scripts, dataset3
is only used to build the small, no intensity version of dataset1
, while a script-edited copy of dataset1
is used to build the nadeemlab/spt-db-preloaded-1small
image used by workflow tests per its compose.yaml
.
If we aren't to assume that every specimen in the "data measurement study" is in the "collection study", I'll need to refactor either or both of cggnn.extract
and FeatureMatrixExtractor.extract_cohorts(study)
(which itself uses StratificationPuller
) to select only for specimens that exist in both tables, i.e., are both assigned to a strata and have measurement data. Can I assume that the latter is a subset of the former?
I'd also like to suggest changing or thoroughly documenting the test dataset build processes, for the sake of future debugging. The difference between how test/test_data/adi_preprocessed_tables
presented the raw data versus how the build/import_test_dataset
actually built the test databases didn't occur to me, causing me to spend a while operating on an incorrect assumption about how the databases were laid about after upload. Granted, this was a clear unforced error on my part, but I think this process could be made more intuitive for future developers, and could be as simple as renaming import_test_dataset1smallnointensity
to import_test_dataset3
so that it's implied that all scripts that do retain the import_test_dataset1
prefix are derived from dataset1
.
Here's the full traceback for the issue:
Traceback (most recent call last):
File "/usr/local/lib/python3.11/dist-packages/spatialprofilingtoolbox/cggnn/scripts/run.py", line 168, in <module>
df_cell, df_label, label_to_result = extract_cggnn_data(
^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/spatialprofilingtoolbox/cggnn/extract.py", line 122, in extract_cggnn_data
df_cell = _create_cell_df({
^
File "/usr/local/lib/python3.11/dist-packages/spatialprofilingtoolbox/cggnn/extract.py", line 123, in <dictcomp>
specimen: extractor.extract(specimen=specimen, retain_structure_id=True)[specimen].dataframe
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/spatialprofilingtoolbox/db/feature_matrix_extractor.py", line 126, in extract
extraction = self._extract(
^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/spatialprofilingtoolbox/db/feature_matrix_extractor.py", line 146, in _extract
data_arrays = self._retrieve_expressions_from_database(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/spatialprofilingtoolbox/db/feature_matrix_extractor.py", line 177, in _retrieve_expressions_from_database
puller.pull(
File "/usr/local/lib/python3.11/dist-packages/spatialprofilingtoolbox/workflow/common/sparse_matrix_puller.py", line 232, in pull
self._retrieve_data_arrays(
File "/usr/local/lib/python3.11/dist-packages/spatialprofilingtoolbox/workflow/common/sparse_matrix_puller.py", line 252, in _retrieve_data_arrays
self._fill_data_arrays_for_study(
File "/usr/local/lib/python3.11/dist-packages/spatialprofilingtoolbox/workflow/common/sparse_matrix_puller.py", line 267, in _fill_data_arrays_for_study
specimens = self._get_pertinent_specimens(study_name, specimen=specimen)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/spatialprofilingtoolbox/workflow/common/sparse_matrix_puller.py", line 305, in _get_pertinent_specimens
raise ValueError(f'Specimen "{specimen}" not found in study "{study_name}".')
ValueError: Specimen "lesion 0_2" not found in study "Melanoma intralesional IL2 - measurement".
With the latest changes, make module-test-workflow
passes, including the new test for the cg-gnn workflow. I think it's ready to merge, but for some reason make test
hangs on ondemand fast cache assessment
. Maybe I broke something there? Although no changes in this PR should've affected it.
Tracking the fast cache assessment hanging, it stems from this
10-27 13:50:15 [ DEBUG ] ondemand.fast_cache_assessor:93: Command is:
10-27 13:50:15 [ DEBUG ] ondemand.fast_cache_assessor:94: cd ../test_data/fast_cache_testing/missing; spt ondemand cache-expressions-data-array --database-config-file=none
Traceback (most recent call last):
File "/usr/local/lib/python3.11/dist-packages/spatialprofilingtoolbox/ondemand/scripts/cache_expressions_data_array.py", line 72, in <module>
main()
File "/usr/local/lib/python3.11/dist-packages/spatialprofilingtoolbox/ondemand/scripts/cache_expressions_data_array.py", line 39, in main
puller.pull_and_write_to_files(getcwd())
File "/usr/local/lib/python3.11/dist-packages/spatialprofilingtoolbox/workflow/common/structure_centroids_puller.py", line 33, in pull_and_write_to_files
study_names = self._get_study_names()
^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/spatialprofilingtoolbox/workflow/common/structure_centroids_puller.py", line 140, in _get_study_names
self.cursor.execute('SELECT name FROM specimen_measurement_study ;')
psycopg2.errors.UndefinedTable: relation "specimen_measurement_study" does not exist
LINE 1: SELECT name FROM specimen_measurement_study ;
^
10-27 13:50:20 [ ERROR ] ondemand.fast_cache_assessor:100: Received CalledProcessError during cache creation, probably insufficient available RAM.
10-27 13:50:20 [ INFO ] ondemand.fast_cache_assessor: Command 'cd ../test_data/fast_cache_testing/missing; spt ondemand cache-expressions-data-array --database-config-file=none' returned non-zero exit status 1.
10-27 13:50:20 [ INFO ] ondemand.fast_cache_assessor: Entering 2 hour indefinite sleep loop.
Some change I made must affect the way things are cached? (Why does it enter a 2 hour sleep loop, that's also indefinite?)
After aggressively cleaning out and remaking Docker images, all tests pass on my machine.
The sleep loop is a fallback error state to prevent excessive usage of resources in case the cache-creation will be constantly re-attempted after unpredicted failures.
Here is some info about the files generated by the new test (number of bytes and number of lines).
$ wc -c *
2533 mi2_importances.csv
1362921 mi2_model.pt
1365454 total
$ wc -l *
180 mi2_importances.csv
4810 mi2_model.pt
4990 total
Note that there is a log flushing issue during this new test. There are "record docker logs" and "print docker logs" scripts running that wrap each test, and normally this system is working OK but for some reason the log messages are spilling over into the next test.
Here is a command you could use to make a superficial signature of the importance scores output:
$ cut -f2 -d, mi2_importances.csv | sort | head | grep -o '[0-9]\.[0-9]\{0,4\}'
0.0162
0.0203
0.0297
0.0340
0.0342
0.0544
0.0632
0.0652
0.0711
0.0775
(Takes top sorted occurrences of these numbers and truncates to 4 digits.)
Regarding logs, looking closer I see that it is only the docker compose containers that are logged carefully in this special way. The regular outputs of the commands (test scripts) that are run in the development container are not treated specially. It seems that for whatever reason we need to flush them manually before proceeding with the next test.
I think I managed to track down the logs issue to a problem with nextflow hanging on to output. By using option process.debug = true
in the nextflow config I am getting some of the output printed immediately (stdout vs stderr perhaps; some output still spills over). It's still sort of a mystery how nextflow is able to keep this output even after the docker container it is run inside of is supposedly killed and a new one started for the next test. The only explanation I can think of is that nextflow is interfering with the docker container's teardown.
I updated the test with the file size and line length you reported to me, as well as added a line to surface the biggest importance scores to see how stable it is across runs. However, my run wasn't able to make it to the importance scores because it errors on database connection
Traceback (most recent call last):
File "/usr/local/lib/python3.11/dist-packages/spatialprofilingtoolbox/cggnn/scripts/run.py", line 175, in <module>
df_cell, df_label, label_to_result = extract_cggnn_data(
^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/spatialprofilingtoolbox/cggnn/extract.py", line 116, in extract_cggnn_data
cohorts = extractor.extract_cohorts(study)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/spatialprofilingtoolbox/db/feature_matrix_extractor.py", line 253, in extract_cohorts
return self._extract_cohorts(study)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/spatialprofilingtoolbox/db/feature_matrix_extractor.py", line 256, in _extract_cohorts
stratification = self._retrieve_derivative_stratification_from_database()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/spatialprofilingtoolbox/db/feature_matrix_extractor.py", line 267, in _retrieve_derivative_stratification_from_database
puller.pull(measured_only=True)
File "/usr/local/lib/python3.11/dist-packages/spatialprofilingtoolbox/db/stratification_puller.py", line 35, in pull
self.stratification = self._retrieve_stratification(measured_only=measured_only)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/spatialprofilingtoolbox/db/stratification_puller.py", line 42, in _retrieve_stratification
study_names = retrieve_study_names(self.database_config_file)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/spatialprofilingtoolbox/db/database_connection.py", line 163, in retrieve_study_names
with DBCursor(database_config_file=database_config_file) as cursor:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/spatialprofilingtoolbox/db/database_connection.py", line 113, in __init__
super().__init__(**kwargs)
File "/usr/local/lib/python3.11/dist-packages/spatialprofilingtoolbox/db/database_connection.py", line 69, in __init__
raise exception
File "/usr/local/lib/python3.11/dist-packages/spatialprofilingtoolbox/db/database_connection.py", line 65, in __init__
super().__init__(self.make_connection(credentials))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/spatialprofilingtoolbox/db/database_connection.py", line 88, in make_connection
return connect(
^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/psycopg2/__init__.py", line 122, in connect
conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
psycopg2.OperationalError: connection to server at "spt-db-testing" (172.18.0.2), port 5432 failed: FATAL: database "default_study_lookup" does not exist
I suspect this is an issue related to the database format update?
You need to rebuild the data-providing images, since the database format has completely changed.
make force-rebuild-data-loaded-images
I am still troubled by the fact that console outputs of the workflow cggnn test are somehow buffered and printed during the subsequent test (proximity). I don't understand where they could be stored during this time. The container in which the tests are run is created and destroyed for each test separately.
Thanks for the tip. On my machine, the importance values came out to
$ cut -f2 -d, "results/mi2_importances.csv" | sort | head | grep -o '[0-9]\.[0-9]\{0,4\}'
0.0175
0.0223
0.0270
0.0410
0.0471
0.0475
0.0501
0.0502
0.0507
0.0639
This doesn't seem stable enough to test with.
Yes these are quite different. In the future we should identify which steps might be amenable to a random seed selection, and we can revisit enforcment of reproducibility along the axis of importance score values.
For the time being, I added a check on at least the average value of those importance scores, to the nearest 2 decimal places. If even this is not reproduced on a second development machine, then we can deprecate it.
I believe I also tracked down and fixed the misplaced log buffer. The container running the test is mounted at run time to a certain path, where Nextflow is storing all of its state (work/ .nextflow.log etc.). Apparently under some exit conditions, Nextflow will fail to display some degug logs in time (i.e. before the container closes) and will later check these files for some previously remaining logs.
I removed the debug option (to try to coerce Nextflow to not print logs at all on its own), and explicitly cat
the .command.log
files after each test run. This seems to have done the trick.
Running the new average test on my hardware passes. Between this and the logs being fixed, this should be ready to merge?
The documentation edits look good and tests pass, ready to merge.
Re-pursuant to #85, these changes add a cg-gnn Nextflow configuration to the list of workflows.