google-research / pix2seq

Pix2Seq codebase: multi-tasks with generative modeling (autoregressive and diffusion)
Apache License 2.0
857 stars 71 forks source link

Hi ,i get the error msg like this : #19

Open ross-Hr opened 1 year ago

ross-Hr commented 1 year ago

2022-10-12 15:43:57.254005: W tensorflow/core/grappler/optimizers/data/slack.cc:103] Could not find a finalprefetch` in the input pipeline to which to introduce slack. I1012 15:43:57.996680 140468541171456 api.py:459] train_step begins... I1012 15:44:07.279798 140468532778752 api.py:459] train_step begins... INFO:tensorflow:batch_all_reduce: 369 all-reduces with algorithm = nccl, num_packs = 1 I1012 15:44:10.852259 140499206152832 cross_device_ops.py:897] batch_all_reduce: 369 all-reduces with algorithm = nccl, num_packs = 1 I1012 15:44:17.169317 140468541171456 api.py:446] Trainable variables: I1012 15:44:17.426999 140468541171456 api.py:446] vit/stem_conv/kernel:0 (16, 16, 3, 768) I1012 15:44:17.432081 140468541171456 api.py:446] vit/stem_conv/bias:0 (768,) I1012 15:44:17.436969 140468541171456 api.py:446] vit/stem_ln/gamma:0 (768,) .... INFO:tensorflow:batch_all_reduce: 369 all-reduces with algorithm = nccl, num_packs = 1 I1012 15:44:31.484436 140499206152832 cross_device_ops.py:897] batch_all_reduce: 369 all-reduces with algorithm = nccl, num_packs = 1 I1012 15:44:37.695064 140468532778752 api.py:459] train_step ends... I1012 15:44:38.920633 140468541171456 api.py:459] train_step ends... 2022-10-12 15:45:08.671253: W tensorflow/core/framework/op_kernel.cc:1768] UNKNOWN: KeyError: 351529 Traceback (most recent call last):

File "/root/anaconda3/envs/pix2seq/lib/python3.9/site-packages/tensorflow/python/ops/script_ops.py", line 271, in call ret = func(*args)

File "/root/anaconda3/envs/pix2seq/lib/python3.9/site-packages/tensorflow/python/autograph/impl/api.py", line 642, in wrapper return func(*args, **kwargs)

File "/tmp/autograph_generated_filecefzj46v.py", line 22, in get_area retval1 = ag.converted_call(ag.ld(np).asarray, ([ag.ld(id_to_ann)[ag.ld(i)]['area'] for i in ag.ld(ids)],), dict(dtype=ag.ld(np).float32), fscope_1)

File "/tmp/autograph_generated_filecefzj46v.py", line 22, in retval1 = ag__.converted_call(ag.ld(np).asarray, ([ag.ld(id_to_ann)[ag.ld(i)]['area'] for i in ag.ld(ids)],), dict(dtype=ag__.ld(np).float32), fscope_1)

KeyError: 351529 2022-10-12 15:45:08.671413: W tensorflow/core/framework/op_kernel.cc:1768] UNKNOWN: KeyError: 415619 Traceback (most recent call last):

File "/root/anaconda3/envs/pix2seq/lib/python3.9/site-packages/tensorflow/python/ops/script_ops.py", line 271, in call ret = func(*args)

File "/root/anaconda3/envs/pix2seq/lib/python3.9/site-packages/tensorflow/python/autograph/impl/api.py", line 642, in wrapper return func(*args, **kwargs)

File "/tmp/autograph_generated_filecefzj46v.py", line 22, in get_area retval1 = ag.converted_call(ag.ld(np).asarray, ([ag.ld(id_to_ann)[ag.ld(i)]['area'] for i in ag.ld(ids)],), dict(dtype=ag.ld(np).float32), fscope_1)

File "/tmp/autograph_generated_filecefzj46v.py", line 22, in retval1 = ag__.converted_call(ag.ld(np).asarray, ([ag.ld(id_to_ann)[ag.ld(i)]['area'] for i in ag.ld(ids)],), dict(dtype=ag__.ld(np).float32), fscope_1)

KeyError: 415619

`

My gpu is 2 * RTX 3070 with 8G .

ross-Hr commented 1 year ago

Is the GPU memory too small ?

chentingpc commented 1 year ago

This looks like some data issue as the complaint was about a keyerror probably related to image id.

On Wed, Oct 12, 2022 at 1:03 AM ross-Hr @.***> wrote:

Is the GPU memory too small ?

— Reply to this email directly, view it on GitHub https://github.com/google-research/pix2seq/issues/19#issuecomment-1275754593, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKERUNSBRJU2Q3HUGY73TTWCZWDNANCNFSM6AAAAAARDAN2FU . You are receiving this because you are subscribed to this thread.Message ID: @.***>

ross-Hr commented 1 year ago

It is the annoantions error. I reload the annoations to solve the error. But the new error likes :

W1018 09:27:13.350448 139820555069248 checkpoint.py:213] Value in checkpoint could not be found in the restored object: (root).optimizer's state 'v' for (root).model.decoder.decoder.dec_layers.5.mlp.mlp_layers.0.dense1.bias WARNING:tensorflow:Value in checkpoint could not be found in the restored object: (root).optimizer's state 'v' for (root).model.decoder.decoder.dec_layers.5.mlp.mlp_layers.0.dense2.kernel W1018 09:27:13.350490 139820555069248 checkpoint.py:213] Value in checkpoint could not be found in the restored object: (root).optimizer's state 'v' for (root).model.decoder.decoder.dec_layers.5.mlp.mlp_layers.0.dense2.kernel WARNING:tensorflow:Value in checkpoint could not be found in the restored object: (root).optimizer's state 'v' for (root).model.decoder.decoder.dec_layers.5.mlp.mlp_layers.0.dense2.bias W1018 09:27:13.350531 139820555069248 checkpoint.py:213] Value in checkpoint could not be found in the restored object: (root).optimizer's state 'v' for (root).model.decoder.decoder.dec_layers.5.mlp.mlp_layers.0.dense2.bias

My tf==2.10.0

This looks like some data issue as the complaint was about a keyerror probably related to image id. On Wed, Oct 12, 2022 at 1:03 AM ross-Hr @.> wrote: Is the GPU memory too small ? — Reply to this email directly, view it on GitHub <#19 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKERUNSBRJU2Q3HUGY73TTWCZWDNANCNFSM6AAAAAARDAN2FU . You are receiving this because you are subscribed to this thread.Message ID: @.>

chentingpc commented 1 year ago

this looks like the checkpoint specified (either pretrained checkpoint, or checkpoint restored from last training in the same model directory) is different from the configured architecture/encoder, please check if the architecture/encoder variant, depth, dim etc match.

On Mon, Oct 17, 2022 at 6:30 PM ross-Hr @.***> wrote:

It is the annoantions error. I reload the annoations to solve the error. But the new error likes :

W1018 09:27:13.350448 139820555069248 checkpoint.py:213] Value in checkpoint could not be found in the restored object: (root).optimizer's state 'v' for (root).model.decoder.decoder.dec_layers.5.mlp.mlp_layers.0.dense1.bias WARNING:tensorflow:Value in checkpoint could not be found in the restored object: (root).optimizer's state 'v' for (root).model.decoder.decoder.dec_layers.5.mlp.mlp_layers.0.dense2.kernel W1018 09:27:13.350490 139820555069248 checkpoint.py:213] Value in checkpoint could not be found in the restored object: (root).optimizer's state 'v' for (root).model.decoder.decoder.dec_layers.5.mlp.mlp_layers.0.dense2.kernel WARNING:tensorflow:Value in checkpoint could not be found in the restored object: (root).optimizer's state 'v' for (root).model.decoder.decoder.dec_layers.5.mlp.mlp_layers.0.dense2.bias W1018 09:27:13.350531 139820555069248 checkpoint.py:213] Value in checkpoint could not be found in the restored object: (root).optimizer's state 'v' for (root).model.decoder.decoder.dec_layers.5.mlp.mlp_layers.0.dense2.bias

My tf==2.10.0

This looks like some data issue as the complaint was about a keyerror probably related to image id. … <#m_1252035150792023031_m2240461384712268694> On Wed, Oct 12, 2022 at 1:03 AM ross-Hr @.> wrote: Is the GPU memory too small ? — Reply to this email directly, view it on GitHub <#19 (comment) https://github.com/google-research/pix2seq/issues/19#issuecomment-1275754593>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKERUNSBRJU2Q3HUGY73TTWCZWDNANCNFSM6AAAAAARDAN2FU https://github.com/notifications/unsubscribe-auth/AAKERUNSBRJU2Q3HUGY73TTWCZWDNANCNFSM6AAAAAARDAN2FU . You are receiving this because you are subscribed to this thread.Message ID: @.>

— Reply to this email directly, view it on GitHub https://github.com/google-research/pix2seq/issues/19#issuecomment-1281693824, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKERUI2WQF5WTTWZUH2FS3WDX4VDANCNFSM6AAAAAARDAN2FU . You are receiving this because you commented.Message ID: @.***>

ross-Hr commented 1 year ago

this looks like the checkpoint specified (either pretrained checkpoint, or checkpoint restored from last training in the same model directory) is different from the configured architecture/encoder, please check if the architecture/encoder variant, depth, dim etc match. On Mon, Oct 17, 2022 at 6:30 PM ross-Hr @.> wrote: It is the annoantions error. I reload the annoations to solve the error. But the new error likes : W1018 09:27:13.350448 139820555069248 checkpoint.py:213] Value in checkpoint could not be found in the restored object: (root).optimizer's state 'v' for (root).model.decoder.decoder.dec_layers.5.mlp.mlp_layers.0.dense1.bias WARNING:tensorflow:Value in checkpoint could not be found in the restored object: (root).optimizer's state 'v' for (root).model.decoder.decoder.dec_layers.5.mlp.mlp_layers.0.dense2.kernel W1018 09:27:13.350490 139820555069248 checkpoint.py:213] Value in checkpoint could not be found in the restored object: (root).optimizer's state 'v' for (root).model.decoder.decoder.dec_layers.5.mlp.mlp_layers.0.dense2.kernel WARNING:tensorflow:Value in checkpoint could not be found in the restored object: (root).optimizer's state 'v' for (root).model.decoder.decoder.dec_layers.5.mlp.mlp_layers.0.dense2.bias W1018 09:27:13.350531 139820555069248 checkpoint.py:213] Value in checkpoint could not be found in the restored object: (root).optimizer's state 'v' for (root).model.decoder.decoder.dec_layers.5.mlp.mlp_layers.0.dense2.bias My tf==2.10.0 This looks like some data issue as the complaint was about a keyerror probably related to image id. … <#m_1252035150792023031_m2240461384712268694> On Wed, Oct 12, 2022 at 1:03 AM ross-Hr @.> wrote: Is the GPU memory too small ? — Reply to this email directly, view it on GitHub <#19 (comment) <#19 (comment)>>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKERUNSBRJU2Q3HUGY73TTWCZWDNANCNFSM6AAAAAARDAN2FU https://github.com/notifications/unsubscribe-auth/AAKERUNSBRJU2Q3HUGY73TTWCZWDNANCNFSM6AAAAAARDAN2FU . You are receiving this because you are subscribed to this thread.Message ID: @.> — Reply to this email directly, view it on GitHub <#19 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKERUI2WQF5WTTWZUH2FS3WDX4VDANCNFSM6AAAAAARDAN2FU . You are receiving this because you commented.Message ID: @.>

I git clone the repo and did not change anything. Which version of TF are you using? I put the Object365 checkpoints into model_dir and the command likes

python3 run.py --mode=train --model_dir=/data/c/Objects365-vitb-640/ --config=configs/config_det_finetune.py --config.dataset.data_dir=/data/c/pix2seq --config.dataset.coco_annotations_dir=/data/c/annotations --config.train.batch_size=8 --config.train.epochs=20 --config.optimization.learning_rate=3e-5

but get the above error. The config.dataset.data_dir is my offline coco tfds.

By the way , I wonder if this is wrong image

ross-Hr commented 1 year ago

well, i change the code in model.py latest_ckpt, ckpt, self._verify_restored = utils.restore_from_checkpoint( model_dir, False, model=model, global_step=optimizer.iterations, optimizer=optimizer) by False to True, i.e. using checkpoint.restore(latest_ckpt).expect_partial() can avoid the error. But i still confused about that.

ross-Hr commented 1 year ago

@chentingpc Hi, do you know how to debug with strategy.run(...) in train_multiple_steps function ? I can not step into the train_step function.

chentingpc commented 1 year ago

you should be able to do pdb in the code when running in eager mode