ome / ome2024-ngff-challenge

Project planning and material repository for the 2024 challenge to generate 1 PB of OME-Zarr data
https://pypi.org/project/ome2024-ngff-challenge/
BSD 3-Clause "New" or "Revised" License
11 stars 8 forks source link

OOM failure for conversion of large images... #28

Closed will-moore closed 3 weeks ago

will-moore commented 1 month ago

Attempting to convert the image at https://uk1s3.embassy.ebi.ac.uk/idr/zarr/v0.4/idr0048A/9846151.zarr/0/ with Shape [1,3,1402,5192,2947] and uncompressed 0-resolution size of 128.71 GB...

Using branch https://github.com/ome/ome2024-ngff-challenge/pull/23 at https://github.com/ome/ome2024-ngff-challenge/pull/23/commits/f17a6de9638acb1d9493a34576aa7916e0737393

This fails with OOM, e.g. initially tried with 3D chunks/shards:

$ ome2024-ngff-challenge --input-bucket=idr --input-endpoint=https://uk1s3.embassy.ebi.ac.uk --input-anon zarr/v0.4/idr0048A/9846151.zarr/0 9846151.zarr --output-shards=1,1,512,512,512 --output-chunks=1,1,64,64,64

Same with writing 2D shards/chunks...

$ ome2024-ngff-challenge --input-bucket=idr --input-endpoint=https://uk1s3.embassy.ebi.ac.uk --input-anon zarr/v0.4/idr0048A/9846151.zarr/0 /data/idr0048/will/9846151_2D_shards_maintest.zarr --output-shards=1,1,1,512,512 --output-chunks=1,1,1,64,64 --log debug

We start to see some chunk dirs created e.g.

ls 9846151_2D_shards_maintest.zarr/0/c/0/0/
0/  10/ 12/ 14/ 16/ 18/ 2/  21/ 23/ 25/ 27/ 29/ 30/ 32/ 34/ 36/ 38/ 4/  41/ 43/ 6/  8/  
1/  11/ 13/ 15/ 17/ 19/ 20/ 22/ 24/ 26/ 28/ 3/  31/ 33/ 35/ 37/ 39/ 40/ 42/ 5/  7/  9/  

But then the machine hangs and becomes unresponsive (can't ssh into it anymore etc).

Also tried with local v2 zarr data as input.

The TensorStore code used is essentially:

import tensorstore as ts
...
    read = ts.open(ts_config).result()
    write = ts.open(write_config).result()
    future = write.write(read)
    future.result()

Follows same pattern as at https://medium.com/@TheHaseebHassan/google-ai-tensorstore-for-array-storage-173326bf5a95

Machine above has 30 GB RAM.

imagesc-bot commented 1 month ago

This issue has been mentioned on Image.sc Forum. There might be relevant details there:

https://forum.image.sc/t/ome2024-ngff-challenge/97363/32

JoOkuma commented 3 weeks ago

Hi @will-moore,

From my understanding is that when you do read = ts.open(...).result(), you're loading the whole thing into memory.

I would check if it works without calling result. Otherwise, it might need to be copied in chunks.

will-moore commented 3 weeks ago

Thanks, I think you're right - But I realise I made an error there in my summary. We don't actually call read.result() It's actually:

read = input_config.ts_read()
write = ts.open(write_config).result()
future = write.write(read)
future.result()

I've managed to work around the OOM error in https://github.com/ome/ome2024-ngff-challenge/pull/23/files by reading & writing a block at a time:

    # read & write a chunk (or shard) at a time:
    blocks = shards if shards is not None else chunks
    for slice_tuple in chunk_iter(read.shape, blocks):
        LOGGER.debug(f"array_location: {slice_tuple}")
        future = write[slice_tuple].write(read[slice_tuple])
        future.result()

However, this is kinda slow. So we need to let TensorStore manage this itself (hopefully in parallel), but also limit the memory usage so we don't fail with OOMs.

JoOkuma commented 3 weeks ago

Do you think it could be faster to use cachepool for the read array and transactions when writing?

I got this from their notes, [1] and [2].

joshmoore commented 3 weeks ago

My assumption was that tensorstore would be able to handle the batching of the reads & writes. @JoOkuma, if you think activating the cache pool will make this possible, happy to give it a try. @will-moore, I've added parallelism to your PR in https://github.com/ome/ome2024-ngff-challenge/pull/23/commits/c23c196dc0ca49be666c4d8fd78cf9cf53a79008

cc: @jbms in case an RFE issue would be of interest

see also: https://forum.image.sc/t/ome2024-ngff-challenge-memory-issues-with-tensorstore/100636

JoOkuma commented 3 weeks ago

My assumption was that tensorstore would be able to handle the batching of the reads & writes. @JoOkuma,

~I'm not a tensorstore expert, but from my point of view, once you call future.results(), you synchronize and disable any batch optimization.~

EDIT: Nevermind, I just saw https://github.com/ome/ome2024-ngff-challenge/commit/c23c196dc0ca49be666c4d8fd78cf9cf53a79008 where you delay the results call.

JoOkuma commented 3 weeks ago

I couldn't find the PR to comment on https://github.com/ome/ome2024-ngff-challenge/commit/c23c196dc0ca49be666c4d8fd78cf9cf53a79008

I would test if (CURRENT)

    for idx, batch in enumerate(batched(chunk_iter(read.shape, blocks), threads)):
        futures = []
        for slice_tuple in batch:
            future = write[slice_tuple].write(read[slice_tuple])
            LOGGER.info(f"batch {idx}: {slice_tuple} scheduled -- {future}")
            futures.append((slice_tuple, future))
        for slice_tuple, future in futures:
            future.result()
            LOGGER.info(f"batch {idx}: {slice_tuple} completed -- {future}")

is faster as

    for idx, batch in enumerate(batched(chunk_iter(read.shape, blocks), threads)):
        with ts.Transaction() as txn:
            for slice_tuple in batch:
                 write.with_transaction(txn)[slice_tuple] = read[slice_tuple]
                 LOGGER.info(f"batch {idx}: {slice_tuple} scheduled in transaction")
joshmoore commented 3 weeks ago

EDIT: Nevermind, I just saw c23c196 where you delay the results call.

:+1: but note that the original implementation did not call .result() on the read but passed it in complete to the write. My understanding from @d-v-b though was that tensorstore didn't automatically do the batched reads.

write.with_transaction(txn)[slice_tuple] = read[slice_tuple]

I'll give it a go.

joshmoore commented 3 weeks ago

Looking good so far! Thanks. I'll do a bit more testing before merging.

It does leave me to wonder though if:

 415 ~ │     with ts.Transaction() as txn:
 416 ~ │         for slice_tuple in chunk_iter(read.shape, blocks):
 417 ~ │             write.with_transaction(txn)[slice_tuple] = read[slice_tuple]
 418 ~ │             LOGGER.info(f"{slice_tuple} scheduled in transaction")
 419 ~ │         LOGGER.info("waiting on transaction...")
 420 ~ │     LOGGER.info("transaction complete")```

isn't as good or better.

JoOkuma commented 3 weeks ago

I like it. I wonder if TensorStore manages the memory usage. If that's the case, both for loops could be wrapped with the transaction.

jbms commented 3 weeks ago

Tensorstore does not batch writes outside of transactions --- the previous writeback cache support was removed because it was not commonly used and added complexity.

Tensorstore also does not currently limit parallelism to limit memory usage, but we are working on that.

@laramiel

joshmoore commented 3 weeks ago

Thanks for the info, @jbms. I've added plots of usage to https://github.com/ome/ome2024-ngff-challenge/pull/23#issuecomment-2298207045 which seem to point out that the strategy outlined above should work well. Not exactly sure of the optimal number of threads

@JoOkuma, here's a plot for the version without transactions:

plot-notxn

Takes about twice as long.

imagesc-bot commented 3 weeks ago

This issue has been mentioned on Image.sc Forum. There might be relevant details there:

https://forum.image.sc/t/ome2024-ngff-challenge-memory-issues-with-tensorstore/100636/7