Issue with Training - Generator error

junyongyou / triq

TRIQ implementation

MIT License

133 stars 23 forks source link

Issue with Training - Generator error #19

Closed ngun7 closed 2 years ago

ngun7 commented 2 years ago

Hello! I followed all the instructions for training and prepared the data & labels accordingly. When I ran the training script it runs for a few steps say 170/2135 and then it stops throwing exception errors.

I then changed return np.array(images_aug), np.array(y_scores) to return np.array(images_aug, dtype='object'), np.array(y_scores, dtype='object'), but now script is just stuck and doesn't consume much GPU memory after a while(700MB/16GB). I even tried training from scratch(not loaded ImageNet pretrained weights) but still no luck.

My conda env details: tensorflow-gpu==2.1.0 tensorflow_addons==0.8.3 h5py==2.10.0

junyongyou commented 2 years ago

I think you probably need to check your data (images). You don't need to convert the format to object.

ngun7 commented 2 years ago

During inference, image size is being reduced to (512,384) with maximum_positional_encoding size of 193. Here, I want to train a model with bigger images so that I don't have to downsample during prediction. During splitting, I resized koniq_normal & koniq_small images to (1024,768) and set maximum positional encoding to 769. Do you think this might be the issue?https://github.com/junyongyou/triq/blob/5a0a79714dd9e1aeb17cfe8430e33d38e16f3187/src/databases/random_split_imageset.py#L16

junyongyou commented 2 years ago

Hi, I don't think that matters. I meant that you need to check your data. Because the model has already been trained for 170 steps without any problems, and then the generator throws an error of data conversion. I would guess something wrong with an image data at the 171st step. So you can try to only run the generator on the images, and see if you can get the correct data at all steps.

ngun7 commented 2 years ago

Hello @junyongyou, thanks for the reply. I checked the data(even tried with a single dataset koniq_normal) and everything seems fine. More than generator error, script stops with "val_loss" key error. This happens all the time after running for few steps in an epoch. Attaching screenshot:

junyongyou commented 2 years ago

Hi, I have just downloaded the code and tried to train for a couple of epochs, and everything was fine. I really don't know what your problem is from. My current TF version 2.5.1. Maybe check your TF?