Closed CarlinLiao closed 3 months ago
On my WSL machine, make test
hangs on multiple tests, for reasons I suspect are related to my environment and not the changes I made. Here's some abridged output:
module-test-ondemand, class counts cohoused datasets
Attempt 24 to check readiness of containers.
Containers are ready
┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈ Setup env. (27s)
┃ class counts cohoused datasets ┈┈┈
(Started recording Docker compose logs for this test.)
07-11 15:28:44 [ DEBUG ] 129 ondemand.providers.counts_provider ┃ Creating feature with specifiers: positive_markers=('CD3',) negative_markers=('CD8', 'CD20')
07-11 15:28:44 [ DEBUG ] 308 workflow.common.export_features ┃ Inserted specification 1 (population fractions).
07-11 15:28:44 [ DEBUG ] 315 workflow.common.export_features ┃ Inserting specifier: (1, '(("CD3",), ("CD8", "CD20"))', '1')
07-11 15:28:44 [ DEBUG ] 33 ondemand.scheduler ┃ Notifying queue activity channel that there are new items.
07-11 15:28:45 [ DEBUG ] 141 ondemand.request_scheduling ┃ Waiting for signal that feature 1 may be ready.
07-11 15:28:45 [ DEBUG ] 150 ondemand.request_scheduling ┃ A job is complete, so 1 may be ready.
07-11 15:28:45 [ DEBUG ] 150 ondemand.request_scheduling ┃ A job is complete, so 1 may be ready.
07-11 15:28:45 [ DEBUG ] 150 ondemand.request_scheduling ┃ A job is complete, so 1 may be ready.
07-11 15:28:45 [ DEBUG ] 150 ondemand.request_scheduling ┃ A job is complete, so 1 may be ready.
07-11 15:28:46 [ DEBUG ] 150 ondemand.request_scheduling ┃ A job is complete, so 1 may be ready.
07-11 15:28:46 [ DEBUG ] 150 ondemand.request_scheduling ┃ A job is complete, so 1 may be ready.
07-11 15:28:46 [ DEBUG ] 150 ondemand.request_scheduling ┃ A job is complete, so 1 may be ready.
^CTraceback (most recent call last):
File "/mount_sources/ondemand/module_tests/get_class_counts.py", line 30, in <module>
main()
File "/mount_sources/ondemand/module_tests/get_class_counts.py", line 14, in main
counts = get_counts(
^^^^^^^^^^^
File "/mount_sources/ondemand/module_tests/get_class_counts.py", line 9, in get_counts
return OnDemandRequester.get_counts_by_specimen(positives, negatives, study_name, 0, ())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/spatialprofilingtoolbox/ondemand/request_scheduling.py", line 59, in get_counts_by_specimen
feature1, counts, counts_all = OnDemandRequester._counts(study_name, phenotype, selected)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/spatialprofilingtoolbox/ondemand/request_scheduling.py", line 104, in _counts
cls._wait_for_wrappedup(connection, get_results)
File "/usr/local/lib/python3.11/dist-packages/spatialprofilingtoolbox/ondemand/request_scheduling.py", line 143, in _wait_for_wrappedup
for notification in notifications:
File "/usr/local/lib/python3.11/dist-packages/psycopg/connection.py", line 935, in notifies
raise ex.with_traceback(None)
KeyboardInterrupt
┃ retrieval of gnn plot ┈┈┈
(Started recording Docker compose logs for this test.)
and then nothing until I KeyboardInterrupt it again.
I'll try completely cleaning everything and running again, but this may mean I can't test my changes locally.
Before printing the logs at the end of a test, they are accumulated here:
build/apiserver/dlogs.*
There I found a repeated error message:
spt-ondemand--testing | 07-11 16:10:54 [ ERROR ] 77 ondemand.worker ┃ The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
spt-ondemand--testing | Traceback (most recent call last):
spt-ondemand--testing | File "/usr/local/lib/python3.11/site-packages/spatialprofilingtoolbox/ondemand/worker.py", line 75, in _compute
spt-ondemand--testing | provider.compute()
spt-ondemand--testing | File "/usr/local/lib/python3.11/site-packages/spatialprofilingtoolbox/ondemand/providers/counts_provider.py", line 33, in compute
spt-ondemand--testing | count = self._perform_count(args, arrays)
spt-ondemand--testing | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
spt-ondemand--testing | File "/usr/local/lib/python3.11/site-packages/spatialprofilingtoolbox/ondemand/providers/counts_provider.py", line 62, in _perform_count
spt-ondemand--testing | return CountsProvider._count_structures_of_partial_signed_signature(*args, arrays)
spt-ondemand--testing | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
spt-ondemand--testing | File "/usr/local/lib/python3.11/site-packages/spatialprofilingtoolbox/ondemand/providers/counts_provider.py", line 74, in _count_structures_of_partial_signed_signature
spt-ondemand--testing | count = CountsProvider._get_count(arrays.phenotype, positives_signature, negatives_signature)
spt-ondemand--testing | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
spt-ondemand--testing | File "/usr/local/lib/python3.11/site-packages/spatialprofilingtoolbox/ondemand/providers/counts_provider.py", line 84, in _get_count
spt-ondemand--testing | return sum((array_phenotype | positives_mask == array_phenotype) and
spt-ondemand--testing | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
spt-ondemand--testing | ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
That should fix the ValueError. numpy recognizes &
and not and
as an instruction to use bitwise operations.
Unfortunately, running make module-test-ondemand
and checking build/ondemand/dlogs.*
provides no insight into why the first test is hanging.
I didn't observe any hanging test, only failing tests: squidpy and squidpy custom phenotypes (apiserver module tests). Perhaps you need:
make force-rebuild-data-loaded-images
?
I force rebuilt data loaded images and it didn't stop tests from hanging :/
The same tests fail for me on main using the laptop, it looks like my last merge was not tested thoroughly enough across different environments. I'll correct these and see if this effects this PR.
In the meantime, consider my remarks in the issue page #337. I am suggesting writing a C submodule to do the for loop rather than delegating to numpy.vectorize
, to gain some stability.
Regarding the C submodule, I think that would more trouble than it's worth. Although vectorizing provides a 12x speedup, the status quo implementation already resolves in minutes or less on our largest dataset (I think? Extrapolating from the runtimes I've seen) so the dev time would far eclipse the cumulative computation time saved (especially since I don't know C). From a practical standpoint, I would recommend either accepting a variation of this PR or the status quo.
What I'm worried about is the future dev time debugging numpy vectorize. I will most likely be responsible for maintaining this dependency. Since the required C is very little, perhaps the best thing is for me to just implement the C portion. We are talking about a for loop with a bitwise operator assignment inside, I think it is overkill to invoke numpy's ufunc apparatus for this. *I'm also sort of still concerned about the speedup being due to multicore. My test machine turns out to have 12 cores (not 24, like I thought), which is almost exactly the same as the speedup ratio...
I ran your snippet on my work MacBook (16 cores; 12 "performance" and 4 "efficiency")
Pulling data arrays for lesion 3_2.
Sample has 353809 cells.
For loop runtime (pandas DataFrame): 9.960739612579346s
For loop runtime (direct): 0.021821260452270508s
Loop runtime (direct map): 0.0277099609375s
Implicitly vectorized runtime: 0.0014913082122802734s
Values are identical? True; True; True
and the 64-core redhat server
Pulling data arrays for lesion 3_2.
Sample has 353809 cells.
For loop runtime (pandas DataFrame): 29.325510263442993s
For loop runtime (direct): 0.037328243255615234s
Loop runtime (direct map): 0.05290555953979492s
Implicitly vectorized runtime: 0.0055408477783203125s
Values are identical? True; True; True
Speedup doesn't seem to be related to core counts.
Great, so it should just be due to inherent performance difference between loops in Python vs. C.
The failing tests seem to be due to my not updating ADIFeaturesUploader
when I migrated the overall metrics computation to the per-sample isolated jobs paradigm. Since the on-demand calculations now involve creating the feature specification in advance (indeed, in a separate container from the containers that do the actual computation), I can't reuse ADIFeaturesUploader
. Somehow I had passing tests before merge, so probably I was working with older data images.
I'll fix this up as a separate bug fix issue, then I'll check that the work on this PR passes and merge.
At the same time I will spin off a separate issue to implement this inner array iteration with a C submodule, to replace the call to numpy vectorize. I want to do this to establish a pattern of delegation to small C functions that we can hopefully also use elsewhere to speed up the computationally expensive tasks (those that cannot be handled by numpy).
I was also getting intermittent hangs, maybe 1/4 of the time I did a test run.
The hangs were at a sort-of-understandable place: the computation workers were idle, waiting for the NOTIFY
signal that there are new jobs. (This is instead of polling, which would have much worse performance.)
Looking more closely, it seems that due a single misplaced indentation, the NOTIFY
command is issued possibly right before the new queue items are available, as the database connection with the new item inserts isn't closed (triggering commit?) until immediately after the NOTIFY
. With this behavior, it is up to Postgres to decide whether to clear the inserts first or notify listeners, and apparently sometimes it does one and sometimes the other, depending perhaps on the exact timing.
I think this is now fixed.
To close #337. This should be ready but I've left it as a draft because I won't have access to a machine that can run the Docker test suite until late tonight.