Changing batch size and using multiple gpu makes Incompatible shapes issue.

hyeonjinXZ commented 2 years ago

I have a memory issue. So would it be better to change the batch size?

I changed only two things in the configuration.

batch_size = 56 -> 4 eval_batch_size = 7 -> 4.

But it makes a dimension error as below.

tensorflow.python.framework.errors_impl.InvalidArgumentError: Incompatible shapes at component 0: expected [1,8,17,768] but got [1,32,17,768]. [Op:IteratorGetNext]

What makes this error and what should I do now in this case?
Or would you give me better tips to solve the out-of-memory?

My device is RTX 1080 TI (11GB) X 2

woctezuma commented 2 years ago

Your error occurs because you feed [1,32,17,768] (notice the value 32) when [1,8,17,768] is expected (see the value 8). It would be useful to know the full error message, with the exact line where the error occurs.

What you edited must have been: https://github.com/google-research/xmcgan_image_generation/blob/edbff388c0c8e35af24d7bf8b7dbc4375729dcab/xmcgan/configs/coco_xmc.py#L49-L50

hyeonjinXZ commented 2 years ago

Thank you for your reply. Yes. I edited only two things you mentioned.

https://github.com/google-research/xmcgan_image_generation/blob/edbff388c0c8e35af24d7bf8b7dbc4375729dcab/xmcgan/configs/coco_xmc.py#L49-L50

And full error message is as below.

woctezuma commented 2 years ago

Ok, so the error occurs at: https://github.com/google-research/xmcgan_image_generation/blob/22a7ef2914787904949fe1fc3f5e560f1e75db29/xmcgan/train_utils.py#L421

after a call of: https://github.com/google-research/xmcgan_image_generation/blob/22a7ef2914787904949fe1fc3f5e560f1e75db29/xmcgan/main.py#L62

hyeonjinXZ commented 2 years ago

https://github.com/google-research/xmcgan_image_generation/blob/22a7ef2914787904949fe1fc3f5e560f1e75db29/xmcgan/train_utils.py#L421

Yes right. and when I print the 'next(train_iter)' before 421 line, it occurs error as below.

hyeonjinXZ commented 2 years ago

Or is there more value I should change in the configuration file? https://github.com/google-research/xmcgan_image_generation/blob/edbff388c0c8e35af24d7bf8b7dbc4375729dcab/xmcgan/configs/coco_xmc.py

woctezuma commented 2 years ago

If I were you, I would try to change fewer parameters actually. Try to get the code running without changing the eval batch size.

For instance:

 config.batch_size = 28
 config.eval_batch_size = 7

or

 config.batch_size = 14
 config.eval_batch_size = 7

Indeed:

going from a batch size of 56 to 4 might have been too drastic.
simultaneously changing the eval batch size a bit (from 7 to 4) makes it harder to know where the issue comes from.

Moreover, I see in the README that:

By default, the configs/coco_xmc.py config is used, which runs an experiment for 128px images. This is able to accommodate a batch size of 8 on each GPU, and achieves an FID of around 10.5 - 11.0 with the EMA weights.

I think you have 2 GPUs. Maybe try with a batch size of at least 16? Maybe 4 was too small.

Finally, I see stuff like these which could be where the error arises, not sure about that. I would like to know where the value 32 comes from in [1,32,17,768] in your error message.

https://github.com/google-research/xmcgan_image_generation/blob/edbff388c0c8e35af24d7bf8b7dbc4375729dcab/xmcgan/libml/input_pipeline.py#L43-L47

where:

https://github.com/google-research/xmcgan_image_generation/blob/edbff388c0c8e35af24d7bf8b7dbc4375729dcab/xmcgan/configs/coco_xmc.py#L57

As a side-note, there is no enforcement that eval_batch_size has to be divisible by the number of GPUs, so you should be able to let it be equal to 7.

https://github.com/google-research/xmcgan_image_generation/blob/edbff388c0c8e35af24d7bf8b7dbc4375729dcab/xmcgan/libml/input_pipeline.py#L88-L89

hyeonjinXZ commented 2 years ago

@woctezuma When I use one GPU and configuration is below makes it work:) Thank you for your kind reply. config.batch_size = 8 config.d_step_per_g_step = 4

But when I use two GPUs and 16 batch sizes, it makes an error as below. It changed batch dimensions from 1 to 2.

Why this error occurs? and how can I fix it?
Can changing the batch size and d_step_per_g_step makes performance different like low or high?

woctezuma commented 2 years ago

It is hard to say, but I believe the error that you see comes from a line like this one:

https://github.com/google-research/xmcgan_image_generation/blob/edbff388c0c8e35af24d7bf8b7dbc4375729dcab/xmcgan/libml/input_pipeline.py#L83

where you have the first batch dim that is the number of GPUs.

In your error message, it seems the code is expecting to see data chunked for 2 GPUs but receives data for 1 GPU.

I wonder if there is an option to toggle ON the support for multiple GPUs. That being said, It could be something else, I don't have the expertise here.

kohjingyu commented 2 years ago

@Hyeonjin1989 I believe multiple GPUs should be supported natively by JAX? I did not have to do anything when running it on > 1 GPU. Can you print jax.local_device_count() to see what the value returned is?

As for your second question: We did find that performance is quite sensitive to batch size. I've never run the model on 2 GPUs, so I suggest that you do some quick hyperparameter sweeps if possible to find the best performance.

hyeonjinXZ commented 2 years ago

@woctezuma @kohjingyu Thank you for your kind help :)

I printed jax.local_device_count() and per_device_batch_size_train before below line and it is 2 and 32. https://github.com/google-research/xmcgan_image_generation/blob/22a7ef2914787904949fe1fc3f5e560f1e75db29/xmcgan/libml/input_pipeline.py#L71

Details
- command line: CUDA_VISIBLE_DEVICES=0,1 python -m xmcgan.main --config="xmcgan/configs/coco_xmc.py" --mode="train" --workdir=./exp/
- error message: tensorflow.python.framework.errors_impl.InvalidArgumentError: Incompatible shapes at component 0: expected [2,32,17,768] but got [1,32,17,768]. [Op:IteratorGetNext]
- modified configuration
- config.batch_size = 16
- config.d_step_per_g_step = 4

kohjingyu commented 2 years ago

Can you paste your full error log?

hyeonjinXZ commented 2 years ago

@kohjingyu This is my full error log. error message: tensorflow.python.framework.errors_impl.InvalidArgumentError: Incompatible shapes at component 0: expected [2,32,17,768] but got [1,32,17,768]. [Op:IteratorGetNext]

I also print jax.local_device_count() and per_device_batch_size_train. please find it in the log.

hyeonjinXZ commented 2 years ago

Multiple GPU makes a issue. And I found the similar issue which said that the input data size must be a multiple of the number of GPUs. How can I shave down the data size?

Error: " tensorflow.python.framework.errors_impl.InvalidArgumentError: Incompatible shapes at component 0: expected [2,32,17,768] but got [1,32,17,768]."

details
- config.batch_size = 16
- config.d_step_per_g_step = 4

woctezuma commented 2 years ago

To be clear, the StackOverflow answer comes from https://github.com/kuza55/keras-extras/issues/7#issuecomment-304923136. I am not sure it is relevant, as you can see that it has received as many thumbs up as thumbs down.

Are you forced to edit d_step_per_g_step? What was the error which prompted this change?

Also, it it normal that you have these lines in your log (reminiscent of #8)?

log

hyeonjinXZ commented 2 years ago

Changing only batch size=14 without changing config.d_step_per_g_step=2 makes an 'Incompatible shapes' error. And below code may cause this error. If I set batch_size=14 and d_step_per_g_step=8, it works well without the 'Incompatible shapes' error using one GPU.

https://github.com/google-research/xmcgan_image_generation/blob/edbff388c0c8e35af24d7bf8b7dbc4375729dcab/xmcgan/libml/input_pipeline.py#L47

I don't have TPU. So there are below lines normally.

I1107 21:05:37.584076 47723213264704 xla_bridge.py:212] Unable to initialize backend 'tpu_driver': Not found: Unable to find driver in registry given worker: local:// I1107 21:05:37.818347 47723213264704 xla_bridge.py:212] Unable to initialize backend 'tpu': Invalid argument: TpuPlatform is not available. /localscratch/xianzhen.18653261.0/env_xmc_gan_v1/lib/python3.8/site-packages/jax/lib/xla_bridge.py:368: UserWarning: jax.host_count has been renamed to jax.process_count. This alias will eventually be removed; please update your code.

google-research / xmcgan_image_generation

Changing batch size and using multiple gpu makes Incompatible shapes issue. #9