Vectorize operations on CellDataArrays

CarlinLiao commented 4 months ago

To close #337. This should be ready but I've left it as a draft because I won't have access to a machine that can run the Docker test suite until late tonight.

CarlinLiao commented 4 months ago

On my WSL machine, make test hangs on multiple tests, for reasons I suspect are related to my environment and not the changes I made. Here's some abridged output:

module-test-ondemand, class counts cohoused datasets

Attempt 24 to check readiness of containers.
Containers are ready
┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈ Setup env.     (27s)     
┃ class counts cohoused datasets ┈┈┈

(Started recording Docker compose logs for this test.)
07-11 15:28:44 [ DEBUG ] 129 ondemand.providers.counts_provider             ┃ Creating feature with specifiers: positive_markers=('CD3',) negative_markers=('CD8', 'CD20')
07-11 15:28:44 [ DEBUG ] 308 workflow.common.export_features                ┃ Inserted specification 1 (population fractions).
07-11 15:28:44 [ DEBUG ] 315 workflow.common.export_features                ┃ Inserting specifier: (1, '(("CD3",), ("CD8", "CD20"))', '1')
07-11 15:28:44 [ DEBUG ]  33 ondemand.scheduler                             ┃ Notifying queue activity channel that there are new items.
07-11 15:28:45 [ DEBUG ] 141 ondemand.request_scheduling                    ┃ Waiting for signal that feature 1 may be ready.
07-11 15:28:45 [ DEBUG ] 150 ondemand.request_scheduling                    ┃ A job is complete, so 1 may be ready.
07-11 15:28:45 [ DEBUG ] 150 ondemand.request_scheduling                    ┃ A job is complete, so 1 may be ready.
07-11 15:28:45 [ DEBUG ] 150 ondemand.request_scheduling                    ┃ A job is complete, so 1 may be ready.
07-11 15:28:45 [ DEBUG ] 150 ondemand.request_scheduling                    ┃ A job is complete, so 1 may be ready.
07-11 15:28:46 [ DEBUG ] 150 ondemand.request_scheduling                    ┃ A job is complete, so 1 may be ready.
07-11 15:28:46 [ DEBUG ] 150 ondemand.request_scheduling                    ┃ A job is complete, so 1 may be ready.
07-11 15:28:46 [ DEBUG ] 150 ondemand.request_scheduling                    ┃ A job is complete, so 1 may be ready.
^CTraceback (most recent call last):
  File "/mount_sources/ondemand/module_tests/get_class_counts.py", line 30, in <module>
    main()
  File "/mount_sources/ondemand/module_tests/get_class_counts.py", line 14, in main
    counts = get_counts(
             ^^^^^^^^^^^
  File "/mount_sources/ondemand/module_tests/get_class_counts.py", line 9, in get_counts
    return OnDemandRequester.get_counts_by_specimen(positives, negatives, study_name, 0, ())
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/spatialprofilingtoolbox/ondemand/request_scheduling.py", line 59, in get_counts_by_specimen
    feature1, counts, counts_all = OnDemandRequester._counts(study_name, phenotype, selected)
                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/spatialprofilingtoolbox/ondemand/request_scheduling.py", line 104, in _counts
    cls._wait_for_wrappedup(connection, get_results)
  File "/usr/local/lib/python3.11/dist-packages/spatialprofilingtoolbox/ondemand/request_scheduling.py", line 143, in _wait_for_wrappedup
    for notification in notifications:
  File "/usr/local/lib/python3.11/dist-packages/psycopg/connection.py", line 935, in notifies
    raise ex.with_traceback(None)
KeyboardInterrupt

┃ retrieval of gnn plot ┈┈┈

(Started recording Docker compose logs for this test.)

and then nothing until I KeyboardInterrupt it again.

I'll try completely cleaning everything and running again, but this may mean I can't test my changes locally.

jimmymathews commented 4 months ago

Before printing the logs at the end of a test, they are accumulated here:

build/apiserver/dlogs.*

There I found a repeated error message:

spt-ondemand--testing  | 07-11 16:10:54 [ ERROR ]  77 ondemand.worker                                ┃ The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
spt-ondemand--testing  | Traceback (most recent call last):
spt-ondemand--testing  |   File "/usr/local/lib/python3.11/site-packages/spatialprofilingtoolbox/ondemand/worker.py", line 75, in _compute
spt-ondemand--testing  |     provider.compute()
spt-ondemand--testing  |   File "/usr/local/lib/python3.11/site-packages/spatialprofilingtoolbox/ondemand/providers/counts_provider.py", line 33, in compute
spt-ondemand--testing  |     count = self._perform_count(args, arrays)
spt-ondemand--testing  |             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
spt-ondemand--testing  |   File "/usr/local/lib/python3.11/site-packages/spatialprofilingtoolbox/ondemand/providers/counts_provider.py", line 62, in _perform_count
spt-ondemand--testing  |     return CountsProvider._count_structures_of_partial_signed_signature(*args, arrays)
spt-ondemand--testing  |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
spt-ondemand--testing  |   File "/usr/local/lib/python3.11/site-packages/spatialprofilingtoolbox/ondemand/providers/counts_provider.py", line 74, in _count_structures_of_partial_signed_signature
spt-ondemand--testing  |     count = CountsProvider._get_count(arrays.phenotype, positives_signature, negatives_signature)
spt-ondemand--testing  |             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
spt-ondemand--testing  |   File "/usr/local/lib/python3.11/site-packages/spatialprofilingtoolbox/ondemand/providers/counts_provider.py", line 84, in _get_count
spt-ondemand--testing  |     return sum((array_phenotype | positives_mask == array_phenotype) and
spt-ondemand--testing  |                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
spt-ondemand--testing  | ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

CarlinLiao commented 4 months ago

That should fix the ValueError. numpy recognizes & and not and as an instruction to use bitwise operations.

Unfortunately, running make module-test-ondemand and checking build/ondemand/dlogs.* provides no insight into why the first test is hanging.

jimmymathews commented 4 months ago

I didn't observe any hanging test, only failing tests: squidpy and squidpy custom phenotypes (apiserver module tests). Perhaps you need:

make force-rebuild-data-loaded-images

?

CarlinLiao commented 4 months ago

I force rebuilt data loaded images and it didn't stop tests from hanging :/

jimmymathews commented 4 months ago

The same tests fail for me on main using the laptop, it looks like my last merge was not tested thoroughly enough across different environments. I'll correct these and see if this effects this PR.

In the meantime, consider my remarks in the issue page #337. I am suggesting writing a C submodule to do the for loop rather than delegating to numpy.vectorize, to gain some stability.

CarlinLiao commented 4 months ago

Regarding the C submodule, I think that would more trouble than it's worth. Although vectorizing provides a 12x speedup, the status quo implementation already resolves in minutes or less on our largest dataset (I think? Extrapolating from the runtimes I've seen) so the dev time would far eclipse the cumulative computation time saved (especially since I don't know C). From a practical standpoint, I would recommend either accepting a variation of this PR or the status quo.

jimmymathews commented 4 months ago

What I'm worried about is the future dev time debugging numpy vectorize. I will most likely be responsible for maintaining this dependency. Since the required C is very little, perhaps the best thing is for me to just implement the C portion. We are talking about a for loop with a bitwise operator assignment inside, I think it is overkill to invoke numpy's ufunc apparatus for this. *I'm also sort of still concerned about the speedup being due to multicore. My test machine turns out to have 12 cores (not 24, like I thought), which is almost exactly the same as the speedup ratio...

CarlinLiao commented 4 months ago

I ran your snippet on my work MacBook (16 cores; 12 "performance" and 4 "efficiency")

Pulling data arrays for lesion 3_2.
Sample has 353809 cells.

For loop runtime (pandas DataFrame): 9.960739612579346s
          For loop runtime (direct): 0.021821260452270508s
          Loop runtime (direct map): 0.0277099609375s
      Implicitly vectorized runtime: 0.0014913082122802734s
Values are identical? True; True; True

and the 64-core redhat server

Pulling data arrays for lesion 3_2.
Sample has 353809 cells.

For loop runtime (pandas DataFrame): 29.325510263442993s
          For loop runtime (direct): 0.037328243255615234s
          Loop runtime (direct map): 0.05290555953979492s
      Implicitly vectorized runtime: 0.0055408477783203125s
Values are identical? True; True; True

Speedup doesn't seem to be related to core counts.

jimmymathews commented 4 months ago

Great, so it should just be due to inherent performance difference between loops in Python vs. C.

The failing tests seem to be due to my not updating ADIFeaturesUploader when I migrated the overall metrics computation to the per-sample isolated jobs paradigm. Since the on-demand calculations now involve creating the feature specification in advance (indeed, in a separate container from the containers that do the actual computation), I can't reuse ADIFeaturesUploader. Somehow I had passing tests before merge, so probably I was working with older data images.

I'll fix this up as a separate bug fix issue, then I'll check that the work on this PR passes and merge.

At the same time I will spin off a separate issue to implement this inner array iteration with a C submodule, to replace the call to numpy vectorize. I want to do this to establish a pattern of delegation to small C functions that we can hopefully also use elsewhere to speed up the computationally expensive tasks (those that cannot be handled by numpy).

jimmymathews commented 3 months ago

I was also getting intermittent hangs, maybe 1/4 of the time I did a test run. The hangs were at a sort-of-understandable place: the computation workers were idle, waiting for the NOTIFY signal that there are new jobs. (This is instead of polling, which would have much worse performance.) Looking more closely, it seems that due a single misplaced indentation, the NOTIFY command is issued possibly right before the new queue items are available, as the database connection with the new item inserts isn't closed (triggering commit?) until immediately after the NOTIFY. With this behavior, it is up to Postgres to decide whether to clear the inserts first or notify listeners, and apparently sometimes it does one and sometimes the other, depending perhaps on the exact timing. I think this is now fixed.

nadeemlab / SPT

Vectorize operations on CellDataArrays #338