nadeemlab / SPT

Spatial profiling toolbox for spatial characterization of tumor immune microenvironment in multiplex images
https://oncopathtk.org
Other
21 stars 2 forks source link

Integrate brotli compression feature #372

Closed jimmymathews closed 1 month ago

jimmymathews commented 1 month ago

I created this branch off compress-cell-data to integrate with main and do testing. I ended up slightly refactoring the cache assessment for two main reasons:

  1. The logic was quite complicated, as @franciscouzo observed when implementing this feature, and much of this was ripe for deprecation.
  2. I wanted the different types of "cache" files to be assessed/recreated independently and automatically.

A motivation for (2) is to make it easier to apply the new changes to the datasets in the existing database (i.e. compute new compressed blobs), by using the normal sync command.

Along the way I updated a lot of names since many of these were old and misleading. The main update/sync command is now:

spt ondemand assess-recreate-cache

I locally tested this against rebuilt test database images.

As a proof of concept I have successfully run it on one of the datasets:

$ spt ondemand assess-recreate-cache --database-config-file=.spt_db.config --study-file=study.json
10-15 00:01:36 [ INFO  ] ondemand.cache_assessment                          ┃ Found expressions index file(s).
10-15 00:01:36 [ INFO  ] ondemand.cache_assessment                          ┃ Databased centroids files are present.
10-15 00:01:36 [ INFO  ] ondemand.cache_assessment                          ┃ Binary feature matrices and position data is up to date, not recreating.
10-15 00:01:37 [ INFO  ] ondemand.cache_assessment                          ┃ UMAP binary format data is up to date, not recreating.
10-15 00:01:39 [ INFO  ] ondemand.cache_assessment                          ┃ Study Melanoma CyTOF ICI lacks some compressed payloads.
10-15 00:01:39 [ INFO  ] ondemand.cache_assessment                          ┃ Deleting the Compressed binary per-sample payloads. (Melanoma CyTOF ICI)
10-15 00:01:41 [ INFO  ] ondemand.cache_assessment                          ┃ Recreating the Compressed binary per-sample payloads. (Melanoma CyTOF ICI)
10-15 00:01:44 [ INFO  ] workflow.common.cache_pulling                      ┃ Created brotli compressed cell data for Mold_sample_01RD
10-15 00:01:44 [ INFO  ] workflow.common.cache_pulling                      ┃ Created brotli compressed cell data for Mold_sample_02RD
10-15 00:01:44 [ INFO  ] workflow.common.cache_pulling                      ┃ Created brotli compressed cell data for Mold_sample_04RD
10-15 00:01:45 [ INFO  ] workflow.common.cache_pulling                      ┃ Created brotli compressed cell data for Mold_sample_05RD
10-15 00:01:46 [ INFO  ] workflow.common.cache_pulling                      ┃ Created brotli compressed cell data for Mold_sample_06RD
10-15 00:01:46 [ INFO  ] workflow.common.cache_pulling                      ┃ Created brotli compressed cell data for Mold_sample_08BL
10-15 00:01:47 [ INFO  ] workflow.common.cache_pulling                      ┃ Created brotli compressed cell data for Mold_sample_09RD
10-15 00:01:48 [ INFO  ] workflow.common.cache_pulling                      ┃ Created brotli compressed cell data for Mold_sample_10RD
10-15 00:01:49 [ INFO  ] workflow.common.cache_pulling                      ┃ Created brotli compressed cell data for Mold_sample_12RD
10-15 00:01:49 [ INFO  ] workflow.common.cache_pulling                      ┃ Created brotli compressed cell data for Mold_sample_13RD
10-15 00:01:49 [ INFO  ] workflow.common.cache_pulling                      ┃ Created brotli compressed cell data for Mold_sample_14RD
10-15 00:01:50 [ INFO  ] workflow.common.cache_pulling                      ┃ Created brotli compressed cell data for Mold_sample_16BL
10-15 00:01:50 [ INFO  ] workflow.common.cache_pulling                      ┃ Created brotli compressed cell data for Mold_sample_19BL
10-15 00:01:51 [ INFO  ] workflow.common.cache_pulling                      ┃ Created brotli compressed cell data for Mold_sample_21RD
10-15 00:01:52 [ INFO  ] workflow.common.cache_pulling                      ┃ Created brotli compressed cell data for Mold_sample_22RD
10-15 00:01:52 [ INFO  ] workflow.common.cache_pulling                      ┃ Created brotli compressed cell data for Mold_sample_23RD
10-15 00:01:53 [ INFO  ] workflow.common.cache_pulling                      ┃ Created brotli compressed cell data for Mold_sample_24RD
10-15 00:01:54 [ INFO  ] workflow.common.cache_pulling                      ┃ Created brotli compressed cell data for Mold_sample_25RD
10-15 00:01:54 [ INFO  ] workflow.common.cache_pulling                      ┃ Created brotli compressed cell data for Mold_sample_26BL
10-15 00:01:55 [ INFO  ] workflow.common.cache_pulling                      ┃ Created brotli compressed cell data for Mold_sample_29RD
10-15 00:01:55 [ INFO  ] workflow.common.cache_pulling                      ┃ Created brotli compressed cell data for Mold_sample_31RD
10-15 00:01:56 [ INFO  ] workflow.common.cache_pulling                      ┃ Created brotli compressed cell data for Mold_sample_32RD
10-15 00:01:57 [ INFO  ] workflow.common.cache_pulling                      ┃ Created brotli compressed cell data for Mold_sample_33RD
10-15 00:01:57 [ INFO  ] workflow.common.cache_pulling                      ┃ Created brotli compressed cell data for Mold_sample_34RD
10-15 00:01:58 [ INFO  ] workflow.common.cache_pulling                      ┃ Created brotli compressed cell data for Mold_sample_35RD
10-15 00:01:58 [ INFO  ] workflow.common.cache_pulling                      ┃ Created brotli compressed cell data for Mold_sample_37RD
10-15 00:01:59 [ INFO  ] workflow.common.cache_pulling                      ┃ Created brotli compressed cell data for Mold_sample_39RD
10-15 00:01:59 [ INFO  ] workflow.common.cache_pulling                      ┃ Created brotli compressed cell data for Mold_sample_40RD
10-15 00:02:00 [ INFO  ] workflow.common.cache_pulling                      ┃ Created brotli compressed cell data for Mold_sample_41BL
10-15 00:02:01 [ INFO  ] workflow.common.cache_pulling                      ┃ Created brotli compressed cell data for Mold_sample_42RD
10-15 00:02:03 [WARNING] 208 db.accessors.cells                             ┃ Identifiers 0..112493 not consecutive: 5 should be 2.
10-15 00:02:12 [ INFO  ] workflow.common.cache_pulling                      ┃ Created brotli compressed cell data for UMAP virtual sample
$ spt ondemand assess-recreate-cache --database-config-file=.spt_db.config --study-file=study.json
10-15 00:02:24 [ INFO  ] ondemand.cache_assessment                          ┃ Found expressions index file(s).
10-15 00:02:24 [ INFO  ] ondemand.cache_assessment                          ┃ Databased centroids files are present.
10-15 00:02:24 [ INFO  ] ondemand.cache_assessment                          ┃ Binary feature matrices and position data is up to date, not recreating.
10-15 00:02:25 [ INFO  ] ondemand.cache_assessment                          ┃ UMAP binary format data is up to date, not recreating.
10-15 00:02:26 [ INFO  ] ondemand.cache_assessment                          ┃ Compressed binary per-sample payloads is up to date, not recreating.

After updating the application, the browser logs look like they are indicating that selecting this compression is working correctly. In an example in the browser console, the "transferred" data was 608kB and the "size" was 1.32Mb.

jimmymathews commented 1 month ago

I am also fixing a couple of silly bugs that are not related to this feature:

  1. Removing a completely incorrect hard-coded schema name in one of the queries. It was meant for testing only.
  2. When worker processes fail to delete/pop an item off the job queue, they give up and wait for the signal for a new metric to be computed. But it turns out that the delete/pop query can fail often, possibly due to a race condition, not only when the queue is empty. I added a temporary "heartbeat" notification that signals all the worker processes to keep searching the job queue, up to a timeout.