hubmapconsortium / sprm

GNU General Public License v3.0
4 stars 0 forks source link

Failure in BLAS memory allocation #10

Closed mruffalo closed 3 years ago

mruffalo commented 3 years ago

Processing of a CODEX data set failed at the SPRM step, due to an apparent memory allocation failure in the relevant BLAS library:

$ docker \
    run \
    -i \
    --mount=type=bind,source=/hive/hubmap-test/scratch/trig__2021-06-11T04:17:50.304061+00:00/cwl-out-tmp/scma78f0,target=/noaaQJ \
    --mount=type=bind,source=/hive/hubmap-test/scratch/trig__2021-06-11T04:17:50.304061+00:00/cwl-tmp/5bt5117_,target=/tmp \
    --mount=type=bind,source=/hive/hubmap-test/scratch/trig__2021-06-11T04:17:50.304061+00:00/cwl_out/stitched/expressions,target=/var/lib/cwl/stg84be57e7-81d3-46d7-af05-8b22ae9b3a68/expressions,reado\
nly \
    --mount=type=bind,source=/hive/hubmap-test/scratch/trig__2021-06-11T04:17:50.304061+00:00/cwl_out/stitched/mask,target=/var/lib/cwl/stg02308545-b005-4ca8-9415-61df59d9cc13/mask,readonly \
    --workdir=/noaaQJ \
    --read-only=true \
    --user=68728:23629 \
    --rm \
    --env=TMPDIR=/tmp \
    --env=HOME=/noaaQJ \
    --cidfile=/hive/hubmap-test/scratch/trig__2021-06-11T04:17:50.304061+00:00/cwl-tmp/y7up35a7/20210611013843-402242.cid \
    hubmap/sprm:1.0.3.2 \
    sprm \
    /var/lib/cwl/stg84be57e7-81d3-46d7-af05-8b22ae9b3a68/expressions \
    /var/lib/cwl/stg02308545-b005-4ca8-9415-61df59d9cc13/mask \
    --enable-manhole
Manhole[1:1623389928.4650]: Not patching os.fork and os.forkpty. Activation is done by signal Signals.SIGUSR1
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.

The docker command line was built by cwltool, and is just included for lack of a reason not to. The BLAS : Program is Terminated. Because you tried to allocate too many memory regions. message was repeated more than 30 times before execution stopped.

This was run on the l002 compute node, which has almost 3TB memory.

mruffalo commented 3 years ago

Stitched pixel dimensions: 12665 ⨉ 7481

pecan88 commented 3 years ago

@mruffalo - do we have a target date for resolving this issue affecting CODEX dataset processing?

mruffalo commented 3 years ago

@pecan88 We're testing a fix that seems to resolve the issue, and are testing on a full-size data set now. This SPRM invocation has been running for about a day and a half, and this failure usually manifested much earlier, so this seems very promising -- but we'd still like to see this run succeed before tagging a release. (Testing on a small dataset is much faster but also isn't informative -- this issue never manifested in small test datasets.)

mruffalo commented 3 years ago

Fixed by https://github.com/hubmapconsortium/sprm/commit/77aa8d7cc0500e57ef37d0c1d2017b328ab44621#diff-f34da55ca08f1a30591d8b0b3e885bcc678537b2a9a4aadea4f190806b374ddcR1.