n_tiles not automatically increased

rdrighetto commented 2 years ago

Hi,

My understanding is that in prediction n_tiles should be automatically increased until OOM errors are avoided. However, this does not seem the case. I always run into an OOM error (see below) until I increase n_tiles to at least [2, 4, 2], which then works fine.

This is the error I get (full stderr attached):

2022-08-16 15:44:19.209565: W tensorflow/core/common_runtime/bfc_allocator.cc:441] **_____________________********************************************_________________________________
2022-08-16 15:44:19.209599: W tensorflow/core/framework/op_kernel.cc:1763] OP_REQUIRES failed at concat_op.cc:158 : Resource exhausted: OOM when allocating tensor with shape[1,280,512,928,32] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
2022-08-16 15:44:46.276663: F tensorflow/stream_executor/cuda/cuda_dnn.cc:88] Check failed: narrow == wide (-1946157056 vs. 2348810240)checked narrowing failed; values not equal post-conversion
18.24user 59.67system 1:41.95elapsed 76%CPU (0avgtext+0avgdata 13349336maxresident)k
0inputs+18616outputs (0major+1911054minor)pagefaults 0swaps
srun: error: sgi65: task 0: Exited with exit code 6

cryocare_single_a100.err53499526.txt

I believe that the automatic incrementing of n_tiles is failing because of this other error Check failed: narrow == wide (-1946157056 vs. 2348810240), has anyone seen that?

Thanks!

thorstenwagner commented 2 years ago

I've no idea why that happens to be honest... you @tibuch ?

EuanPyle commented 1 year ago

hey, I'm having a similar issue with n_tiles: if I start at 2,2,2 it crashes as this number is too small. CryoCARE does increase the n_tile number but despite the increases it always crashes: Out of memory, retrying with n_tiles = (2, 4, 2, 1) Out of memory, retrying with n_tiles = (2, 4, 4, 1) Out of memory, retrying with n_tiles = (2, 8, 4, 1) Out of memory, retrying with n_tiles = (2, 8, 8, 1) With error messages like: 2022-09-02 15:01:18.424097: W tensorflow/core/framework/op_kernel.cc:1763] OP_REQUIRES failed at conv_ops_3d.cc:327 : Resource exhausted: OOM when allocating tensor with shape[1,16,264,248,296] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc If I then re-run the job starting at n_tiles=6,6,6 it actually works on the first go.

Maybe it is worth increasing all of the XYZ n_tiles values as it looks like one has stayed at 2? EDIT: Just tried this and increasing the XYZ values equally still doesn't work. Perhaps something to allow the n_tiles to increase a little bit more before bailing?

tibuch commented 1 year ago

Hi,

My understanding is that in prediction n_tiles should be automatically increased until OOM errors are avoided. However, this does not seem the case. I always run into an OOM error (see below) until I increase n_tiles to at least [2, 4, 2], which then works fine.

This is the error I get (full stderr attached):
2022-08-16 15:44:19.209565: W tensorflow/core/common_runtime/bfc_allocator.cc:441] **_____________________********************************************_________________________________
2022-08-16 15:44:19.209599: W tensorflow/core/framework/op_kernel.cc:1763] OP_REQUIRES failed at concat_op.cc:158 : Resource exhausted: OOM when allocating tensor with shape[1,280,512,928,32] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
2022-08-16 15:44:46.276663: F tensorflow/stream_executor/cuda/cuda_dnn.cc:88] Check failed: narrow == wide (-1946157056 vs. 2348810240)checked narrowing failed; values not equal post-conversion
18.24user 59.67system 1:41.95elapsed 76%CPU (0avgtext+0avgdata 13349336maxresident)k
0inputs+18616outputs (0major+1911054minor)pagefaults 0swaps
srun: error: sgi65: task 0: Exited with exit code 6
cryocare_single_a100.err53499526.txt

I believe that the automatic incrementing of n_tiles is failing because of this other error Check failed: narrow == wide (-1946157056 vs. 2348810240), has anyone seen that?

Thanks!

Indeed, this fails because of the Check failed: narrow == wide.... I don't know if we can just add this exception to the try-catch or if something else is actually broken in the install. For now I would need to collect some more information on that behaviour.

tibuch commented 1 year ago

hey, I'm having a similar issue with n_tiles: if I start at 2,2,2 it crashes as this number is too small. CryoCARE does increase the n_tile number but despite the increases it always crashes: Out of memory, retrying with n_tiles = (2, 4, 2, 1) Out of memory, retrying with n_tiles = (2, 4, 4, 1) Out of memory, retrying with n_tiles = (2, 8, 4, 1) Out of memory, retrying with n_tiles = (2, 8, 8, 1) With error messages like: 2022-09-02 15:01:18.424097: W tensorflow/core/framework/op_kernel.cc:1763] OP_REQUIRES failed at conv_ops_3d.cc:327 : Resource exhausted: OOM when allocating tensor with shape[1,16,264,248,296] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc If I then re-run the job starting at n_tiles=6,6,6 it actually works on the first go.

Maybe it is worth increasing all of the XYZ n_tiles values as it looks like one has stayed at 2? EDIT: Just tried this and increasing the XYZ values equally still doesn't work. Perhaps something to allow the n_tiles to increase a little bit more before bailing?

What are the tile sizes if the tomogram is tiled with (2, 8, 8) compared to (6, 6, 6)? It could be that (2, 8, 8) is slightly larger than (6, 6, 6) in number of pixels per tile.

The tiling computes for each axis the size of the tiles and multiplies the number of tiles by 2 for the longest axis. It would only start increasing the number of tiles in Z if the tile size in Z is longer than in X and Y.

I don't understand the EDIT. What do you mean by increasing the XYZ values equally?

Cheers!

EuanPyle commented 1 year ago

Yes, 2,8,8 is larger tile size than 6,6,6. Is it possible for the program to keep trying smaller and smaller tiles for a bit longer? Sorry about the edit, I was confused about how the tiling was calculated when I wrote that so it can be ignored. Thanks!

asarnow commented 1 year ago

I have this error as well, using CUDA 11.0 (as per the instructions) with A6000 cards. My tomograms are not particularly large, 682x960x266, and the GPU has 48GB of memory.

I am able to run prediction using n_tiles: [2,4,2] manually as recommended above.

juglab / cryoCARE_pip

n_tiles not automatically increased #20