FeatureMatrixExtractor channel mismatch in live

CarlinLiao commented 1 year ago

Thank you for resolving the Nextflow script to Bash issue. Resolving this has re-revealed a problem I noticed earlier.

[Task monitor] ERROR nextflow.processor.TaskProcessor - Error executing process > 'run_cggnn'

Caused by:
  Process `run_cggnn` terminated with an error exit status (123)

Command executed:

  #!/bin/bash

  strata_option=$( if [[ "1 3" != "all" ]]; then echo "--strata 1 3"; fi)
  disable_channels_option=$( if [[ "false" == "true" ]]; then echo "--disable_channels"; fi)
  disable_phenotypes_option=$( if [[ "false" = "true" ]]; then echo "--disable_phenotypes"; fi)
  in_ram_option=$( if [[ "true" == "true" ]]; then echo "--in_ram"; fi)
  merge_rois_option=$( if [[ "true" == "true" ]]; then echo "--merge_rois"; fi)
  prune_misclassified_option=$( if [[ "false" == "true" ]]; then echo "--prune_misclassified"; fi)
  upload_importances_option=$( if [[ "false" == "true" ]]; then echo "--upload_importances"; fi)

  echo      --spt_db_config_location \'spt_db.config\'      --study \'Melanoma intralesional IL2\'      ${strata_option}      --validation_data_percent 15      --test_data_percent 0      ${disable_channels_option}      ${disable_phenotypes_option}      --cells_per_slide_target 5000      --target_name \'P Tumor\'      ${in_ram_option}      --batch_size 1      --epochs 5      --learning_rate \'1e-3\'      --k_folds 0      --explainer_model \'pp\'      ${merge_rois_option}      ${prune_misclassified_option}      --output_prefix \'miil2\'      ${upload_importances_option}      | xargs spt cggnn run

Command exit status:
  123

Command output:
  (empty)

Command error:
  [34m11-09 13:06:58 [0m[35m[  [0mDEBUG[0m[35m  ] workflow.common.structure_centroids_puller:[0m[34m96[35m: [0mReceived 100000 shapefiles entries from DB.
  [34m11-09 13:06:58 [0m[35m[  [0mDEBUG[0m[35m  ] workflow.common.structure_centroids_puller:[0m[34m96[35m: [0mReceived 98336 shapefiles entries from DB.
  [34m11-09 13:07:07 [0m[35m[  [0m[32;1mINFO[0m[35m   ] db.feature_matrix_extractor: [0mDone retrieving centroids.
  [34m11-09 13:07:10 [0m[35m[  [0m[32;1mINFO[0m[35m   ] db.feature_matrix_extractor: [0mRetrieving phenotypes from database.
  [34m11-09 13:07:11 [0m[35m[  [0m[32;1mINFO[0m[35m   ] db.feature_matrix_extractor: [0mDone retrieving phenotypes.
  [34m11-09 13:07:11 [0m[35m[  [0m[32;1mINFO[0m[35m   ] db.feature_matrix_extractor: [0mAggregating channel information for one study.
  [34m11-09 13:07:11 [0m[35m[  [0m[32;1mINFO[0m[35m   ] db.feature_matrix_extractor: [0mDone aggregating channel information.
  [34m11-09 13:07:11 [0m[35m[  [0m[32;1mINFO[0m[35m   ] db.feature_matrix_extractor: [0mCreating feature matrices from binary data arrays and centroids.
  [34m11-09 13:07:11 [0m[35m[  [0mDEBUG[0m[35m  ] db.feature_matrix_extractor:[0m[34m180[35m: [0mSpecimen lesion 0_1 .
  Traceback (most recent call last):
    File "/home/liaoc2/miniconda3/envs/spt_cggnn/lib/python3.11/site-packages/pandas/core/internals/construction.py", line 934, in _finalize_columns_and_data
      columns = _validate_or_indexify_columns(contents, columns)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/home/liaoc2/miniconda3/envs/spt_cggnn/lib/python3.11/site-packages/pandas/core/internals/construction.py", line 981, in _validate_or_indexify_columns
      raise AssertionError(
  AssertionError: 28 columns passed, passed data had 54 columns

  The above exception was the direct cause of the following exception:

  Traceback (most recent call last):
    File "/home/liaoc2/miniconda3/envs/spt_cggnn/lib/python3.11/site-packages/spatialprofilingtoolbox/cggnn/scripts/run.py", line 184, in <module>
      df_cell, df_label, label_to_result = extract_cggnn_data(
                                           ^^^^^^^^^^^^^^^^^^^
    File "/home/liaoc2/miniconda3/envs/spt_cggnn/lib/python3.11/site-packages/spatialprofilingtoolbox/cggnn/extract.py", line 130, in extract_cggnn_data
      df_cell = _create_cell_df({
                                ^
    File "/home/liaoc2/miniconda3/envs/spt_cggnn/lib/python3.11/site-packages/spatialprofilingtoolbox/cggnn/extract.py", line 131, in <dictcomp>
      specimen: extractor.extract(specimen=specimen, retain_structure_id=True)[specimen].dataframe
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/home/liaoc2/miniconda3/envs/spt_cggnn/lib/python3.11/site-packages/spatialprofilingtoolbox/db/feature_matrix_extractor.py", line 80, in extract
      extraction = self._extract(
                   ^^^^^^^^^^^^^^
    File "/home/liaoc2/miniconda3/envs/spt_cggnn/lib/python3.11/site-packages/spatialprofilingtoolbox/db/feature_matrix_extractor.py", line 113, in _extract
      return self._create_feature_matrices(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/home/liaoc2/miniconda3/envs/spt_cggnn/lib/python3.11/site-packages/spatialprofilingtoolbox/db/feature_matrix_extractor.py", line 193, in _create_feature_matrices
      dataframe = DataFrame(
                  ^^^^^^^^^^
    File "/home/liaoc2/miniconda3/envs/spt_cggnn/lib/python3.11/site-packages/pandas/core/frame.py", line 782, in __init__
      arrays, columns, index = nested_data_to_arrays(
                               ^^^^^^^^^^^^^^^^^^^^^^
    File "/home/liaoc2/miniconda3/envs/spt_cggnn/lib/python3.11/site-packages/pandas/core/internals/construction.py", line 498, in nested_data_to_arrays
      arrays, columns = to_arrays(data, columns, dtype=dtype)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/home/liaoc2/miniconda3/envs/spt_cggnn/lib/python3.11/site-packages/pandas/core/internals/construction.py", line 840, in to_arrays
      content, columns = _finalize_columns_and_data(arr, columns, dtype)
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/home/liaoc2/miniconda3/envs/spt_cggnn/lib/python3.11/site-packages/pandas/core/internals/construction.py", line 937, in _finalize_columns_and_data
      raise ValueError(err) from err
  ValueError: 28 columns passed, passed data had 54 columns

It's interesting because this error doesn't occur when the cggnn workflow test, so my first thought would be a difference between the test database and scstudies. Does this have to do with the removal of indexing you mentioned earlier this week @jimmymathews?

Originally posted by @CarlinLiao in https://github.com/nadeemlab/SPT/issues/241#issuecomment-1802251462

CarlinLiao commented 1 year ago

Not directly related, but have you been getting errors when running force-rebuild-data-loaded-images? Could be a stale version of the test databases running properly with my local tests but not accurately reflecting the newest changes to the db.

11.14 Fetched 101 MB in 10s (10.4 MB/s)
11.14 E: Failed to fetch http://security.ubuntu.com/ubuntu/pool/main/m/mysql-8.0/libmysqlclient21_8.0.34-0ubuntu0.22.04.1_amd64.deb  404  Not Found [IP: 185.125.190.39 80]
11.14 E: Failed to fetch http://security.ubuntu.com/ubuntu/pool/main/m/mysql-8.0/libmysqlclient-dev_8.0.34-0ubuntu0.22.04.1_amd64.deb  404  Not Found [IP: 185.125.190.39 80]
11.14 E: Unable to fetch some archives, maybe run apt-get update or try with --fix-missing?
------
development_prereqs.Dockerfile:28
--------------------
  26 |     COPY pyproject.toml.unversioned .
  27 |     RUN python -m pip install toml
  28 | >>> RUN apt install libgdal-dev -y
  29 |     RUN python -c 'import toml; c = toml.load("pyproject.toml.unversioned"); print("\n".join(c["project"]["dependencies"]))' | python -m pip install -r /dev/stdin
  30 |     RUN python -c 'import toml; c = toml.load("pyproject.toml.unversioned"); print("\n".join(c["project"]["optional-dependencies"]["all"]))' | python -m pip install -r /dev/stdin
--------------------
ERROR: failed to solve: process "/bin/sh -c apt install libgdal-dev -y" did not complete successfully: exit code: 100

0.917 After this operation, 13.2 MB of additional disk space will be used.
0.917 Get:1 http://deb.debian.org/debian-security bookworm-security/main amd64 libssl-dev amd64 3.0.11-1~deb12u2 [2,430 kB]
1.189 Err:2 http://apt.postgresql.org/pub/repos/apt bookworm-pgdg/main amd64 libpq-dev amd64 16.0-1.pgdg120+1
1.189   404  Not Found [IP: 217.196.149.55 80]
1.192 E: Failed to fetch http://apt.postgresql.org/pub/repos/apt/pool/main/p/postgresql-16/libpq-dev_16.0-1.pgdg120%2b1_amd64.deb  404  Not Found [IP: 217.196.149.55 80]
1.192 E: Unable to fetch some archives, maybe run apt-get update or try with --fix-missing?
1.192 Fetched 2,430 kB in 0s (7,762 kB/s)
------
Dockerfile:9
--------------------
   7 |     RUN apt install python3-venv -y
   8 |     RUN apt install python3-pip -y
   9 | >>> RUN apt install -y libpq-dev
  10 |     RUN apt install -y libgdal-dev
  11 |     RUN python3 -m pip install --break-system-packages psycopg2==2.9.6
--------------------
ERROR: failed to solve: process "/bin/sh -c apt install -y libpq-dev" did not complete successfully: exit code: 100

jimmymathews commented 1 year ago

Regarding force-rebuild-data-loaded-images:

This does not error for me, but I think I know what is going on. When apt install fails with HTTP 404, it typically means that apt update has not been run in a while so the list of package servers is outdated (this is what the suggestion with the error message says). In the internal universe of this docker image, apt update might have been run a long time ago, since the RUN apt update layer is probably cached.

Yesterday Grigoriy and I added libgdal-dev as an explicit dependency in a few places because some upstream python package dependencies of squidpy failed to install automatically under certain environmental conditions, which we chalked up to ARM architecture. This is why there is now a new layer detected in your docker build context, serving as the trigger for appearance of this error in your development environment.

It is relatively easy to overcome this issue on a one-time basis by adding --no-cache to the docker build commands temporarily.

However, it would waste a lot of time if --no-cache was used for every build. The frequency of update of the package servers is not that high, so it is not normally necessary. Every couple of weeks I use make clean-docker-images to start fresh, and this may be the best practice for now. Note that issue #231 will hopefully relieve us permanently from the use of docker in the development context (docker for tests will be done remotely), making obsolete many issues like this one.

jimmymathews commented 1 year ago

For the feature matrix extraction error, I think a little more information is needed.

What are you running to cause this error?
Where are you running it? (Local, HPC...)
What database are you providing to the workflow? (Local postgres, remote RDS, remote prototype)
What datasets are in the database? (Did you load them in yourself locally?)

The error says that at the time of creating the pandas DataFrame for the expression matrix, the row format of the provided rows does not match the stipulated column format. Thus the row format, the column format, or both are incorrect. We should try to determine which of these 3 possibilities is occurring, by independently determining the correct expected format for the rows and the correct expected format for the columns.

It is most likely a bug introduced by PR #230 , which we should try to fix.

CarlinLiao commented 1 year ago

Will try reloading the docker images.

I'm running the full cg-gnn workflow on the full Melanoma intralesional IL2 to verify that it works in a real context and collect output for documentation.
I'm running on the HPC.
I'm providing it scstudies.
Thus, no local databases.

jimmymathews commented 1 year ago

scstudies is a deprecated name, it was the former name of the monolothic database inside the database cluster which contained all of our studies. My question was really about which database "cluster" is searched (i.e. where is the running instance of the postgresql server).

jimmymathews commented 1 year ago

(I now think that the answer is the RDS database, but you can confirm.)

CarlinLiao commented 1 year ago

It is the RDS database (unless there are multiple RDS databases, of course).

jimmymathews commented 1 year ago

The debugging procedure I'm using is to install a local build of spatialprofilingtoolbox wheel from main (i.e. what is produced in dist/ after make development-image), then running an attempted reproduce snippet:

from spatialprofilingtoolbox.db.feature_matrix_extractor import FeatureMatrixExtractor
extractor = FeatureMatrixExtractor('.spt_db.config.aws')
x = extractor.extract(specimen='lesion 0_1')

I am able to reproduce this issue, so the strategy is working. I am now checking the rows and columns, etc.

jimmymathews commented 1 year ago

The rows being provided to the DataFrame constructor have many different lengths, including what seems to be the correct length of 28 (26 channels plus 2 for pixel coordinates), which matches the column format.

Here are some bespoke error messages customized to this issue:

11-09 15:49:52 [  DEBUG  ] db.feature_matrix_extractor:198: Unexpected length 36:
11-09 15:49:52 [  DEBUG  ] db.feature_matrix_extractor:199: [5290.0, 6.0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1]
11-09 15:49:52 [  DEBUG  ] db.feature_matrix_extractor:198: Unexpected length 44:
11-09 15:49:52 [  DEBUG  ] db.feature_matrix_extractor:199: [5295.0, 62.0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1]
11-09 15:49:52 [  DEBUG  ] db.feature_matrix_extractor:198: Unexpected length 52:
11-09 15:49:52 [  DEBUG  ] db.feature_matrix_extractor:199: [5360.0, 285.0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1]
11-09 15:49:52 [  DEBUG  ] db.feature_matrix_extractor:198: Unexpected length 36:
11-09 15:49:52 [  DEBUG  ] db.feature_matrix_extractor:199: [5283.0, 3091.0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1]
11-09 15:49:52 [  DEBUG  ] db.feature_matrix_extractor:198: Unexpected length 36:
11-09 15:49:52 [  DEBUG  ] db.feature_matrix_extractor:199: [4280.0, 2949.0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1]
11-09 15:49:52 [  DEBUG  ] db.feature_matrix_extractor:198: Unexpected length 40:
11-09 15:49:52 [  DEBUG  ] db.feature_matrix_extractor:199: [3892.0, 2947.0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1]
11-09 15:49:52 [  DEBUG  ] db.feature_matrix_extractor:198: Unexpected length 36:
11-09 15:49:52 [  DEBUG  ] db.feature_matrix_extractor:199: [4438.0, 2955.0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1]
11-09 15:49:52 [  DEBUG  ] db.feature_matrix_extractor:198: Unexpected length 36:
11-09 15:49:52 [  DEBUG  ] db.feature_matrix_extractor:199: [4579.0, 2965.0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1]
11-09 15:49:52 [  DEBUG  ] db.feature_matrix_extractor:198: Unexpected length 34:
11-09 15:49:52 [  DEBUG  ] db.feature_matrix_extractor:199: [4100.0, 2953.0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1]
11-09 15:49:52 [  DEBUG  ] db.feature_matrix_extractor:198: Unexpected length 34:
11-09 15:49:52 [  DEBUG  ] db.feature_matrix_extractor:199: [4136.0, 2952.0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1]

jimmymathews commented 1 year ago

I added an "error guard" against this inconsistency, in branch issue243.

jimmymathews commented 1 year ago

The binary-encoded expression vectors in these ints are sometimes erroneous, overflowing the expected number of channels (e.g. 26 in this case).

These are originating in cache files on a persistent volume, which files I am having trouble deleting and refreshing. After manual deletion they seem to "rise from the dead" and come right back. So, most likely there are outdated cache files here from who knows when. I'm stilll working on this.

jimmymathews commented 1 year ago

Wait what I said is not right, FeatureMatrixExtractor is retrieving direct from the database (I was confusing this with the ondemand service).

jimmymathews commented 1 year ago

(To answer your question, it does not pertain to removal of indexing because that has not taken place yet.)

jimmymathews commented 1 year ago

It seems that at some point I accidentally partially uploaded a dataset twice. This isn't supposed to be possible, so that's a bug. I'm seeing 52 ( = 26 * 2 ) expression values in many cases that are supposed to be 26.

jimmymathews commented 1 year ago

Github automatically closed this issue with #245, even though I explicitly stated that that PR does not close this issue. Weird.

CarlinLiao commented 1 year ago

After #245 running the cggnn workflow on RDS "Melanoma intalesional IL2" worked without error, but "Urothelial ICI" throws a similar error.

[  DEBUG  ] workflow.common.sparse_matrix_puller:377: Received 34123866 sparse entries total from DB.
  Traceback (most recent call last):
    File "/home/liaoc2/miniconda3/envs/spt_cggnn/lib/python3.11/site-packages/spatialprofilingtoolbox/cggnn/scripts/run.py", line 184, in <module>
      df_cell, df_label, label_to_result = extract_cggnn_data(
                                           ^^^^^^^^^^^^^^^^^^^
    File "/home/liaoc2/miniconda3/envs/spt_cggnn/lib/python3.11/site-packages/spatialprofilingtoolbox/cggnn/extract.py", line 130, in extract_cggnn_data
      df_cell = _create_cell_df({
                                ^
    File "/home/liaoc2/miniconda3/envs/spt_cggnn/lib/python3.11/site-packages/spatialprofilingtoolbox/cggnn/extract.py", line 131, in <dictcomp>
      specimen: extractor.extract(specimen=specimen, retain_structure_id=True)[specimen].dataframe
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/home/liaoc2/miniconda3/envs/spt_cggnn/lib/python3.11/site-packages/spatialprofilingtoolbox/db/feature_matrix_extractor.py", line 80, in extract  
      extraction = self._extract(
                   ^^^^^^^^^^^^^^
    File "/home/liaoc2/miniconda3/envs/spt_cggnn/lib/python3.11/site-packages/spatialprofilingtoolbox/db/feature_matrix_extractor.py", line 98, in _extract 
      data_arrays = self._retrieve_expressions_from_database(
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/home/liaoc2/miniconda3/envs/spt_cggnn/lib/python3.11/site-packages/spatialprofilingtoolbox/db/feature_matrix_extractor.py", line 129, in _retrieve_expressions_from_database
      puller.pull(
    File "/home/liaoc2/miniconda3/envs/spt_cggnn/lib/python3.11/site-packages/spatialprofilingtoolbox/workflow/common/sparse_matrix_puller.py", line 235, in pull
      self._retrieve_data_arrays(
    File "/home/liaoc2/miniconda3/envs/spt_cggnn/lib/python3.11/site-packages/spatialprofilingtoolbox/workflow/common/sparse_matrix_puller.py", line 255, in _retrieve_data_arrays
      self._fill_data_arrays_for_study(
    File "/home/liaoc2/miniconda3/envs/spt_cggnn/lib/python3.11/site-packages/spatialprofilingtoolbox/workflow/common/sparse_matrix_puller.py", line 286, in _fill_data_arrays_for_study
      parsed = parse(sparse_entries, _specimen, continuous_also=continuous_also)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/home/liaoc2/miniconda3/envs/spt_cggnn/lib/python3.11/site-packages/spatialprofilingtoolbox/workflow/common/sparse_matrix_puller.py", line 462, in _parse_data_arrays_by_specimen
      self._check_targets(list(df_group['target']), target_index_lookup)
    File "/home/liaoc2/miniconda3/envs/spt_cggnn/lib/python3.11/site-packages/spatialprofilingtoolbox/workflow/common/sparse_matrix_puller.py", line 476, in _check_targets
      raise ValueError(f'Got {len(targets)} expression values for some cell, expected {len(target_index_lookup)} or fewer.')
  ValueError: Got 42 expression values for some cell, expected 14 or fewer.

jimmymathews commented 1 year ago

The relation to the SPT codebase is marginal, the dataset integrity is the issue. Yesterday I re-uploaded the melanoma dataset being careful not to accidentally interrupt/restart any import operations. But I haven't done the others yet.

jimmymathews commented 1 year ago

The long amount of time required to do these dataset management tasks is why I am prioritizing so highly issues like #222 and #226 . I want to unblock the other work.

jimmymathews commented 1 year ago

This issue was reproduced and then fixed by cleaning the datasets in the db. Tested provisionally in live db.

nadeemlab / SPT

FeatureMatrixExtractor channel mismatch in live #243