Error while trying to run vgg16 with input shape (224, 224)

sagarvijaygupta commented 6 years ago

tcmalloc: large alloc 3666862080 bytes == 0x52d8000 @ 0x7fb937352f21 0x7fb934aa2ae5 0x7fb934b05a13 0x7fb934b07956 0x7fb934b9f728 0x4c4b0b 0x54f3c4 0x551ee0 0x54efc1 0x54f24d 0x553aaf 0x54efc1 0x54ff73 0x42b3c9 0x42b5b5 0x44182b 0x421f64 0x7fb93619a1c1 0x42201a (nil) tcmalloc: large alloc 3666862080 bytes == 0x1bc4ee000 @ 0x7fb937351107 0x7fb934aa29a1 0x7fb934b08690 0x7fb934afdc15 0x7fb934ba03b3 0x4c4b0b 0x54f3c4 0x551ee0 0x54efc1 0x54f24d 0x553aaf 0x54efc1 0x54f24d 0x553aaf 0x54efc1 0x54f24d 0x553aaf 0x54efc1 0x54ff73 0x42b3c9 0x42b5b5 0x44182b 0x421f64 0x7fb93619a1c1 0x42201a (nil)

Shashi456 commented 6 years ago

Isn't this similar to #179?

sagarvijaygupta commented 6 years ago

@Shashi456 We load all the images in get_ImageDataGenerator which presently is 2 * 3046 into a given size of 224x224 (we can't keep size to be 32x24 as @marco-c mentioned that it was arbitrary). Since all images are loaded into memory I think this results in memory crash. #179 showed inconsistency with shape. This shows memory crash if I change shape.

Shashi456 commented 6 years ago

@sagarvijaygupta i'd like to replicate this bug, you just rain train.py right?

sagarvijaygupta commented 6 years ago

Just change autowebcompat/utils.py load_image target size to 224x224 and run train.py to for vgg16

sagarvijaygupta commented 6 years ago

Update: I am able to run inception model with size (32x24) but it fails with size of (100x100) without even giving tcmalloc warning.

sagarvijaygupta commented 6 years ago

Update: I was able to run vgg16 model but with size (100x100) and batch size = 1. Google Collab has nearly 13 GB of RAM.

Shashi456 commented 6 years ago

@sagarvijaygupta I've been running into allocation issues , but I don't think it's because of the size. I was trying it locally on my PC. I'll try it again after rectifying the issue and update here.

And can you try 100*100 for all of them? We could make that default if it works for all of them

sagarvijaygupta commented 6 years ago

@Shashi456 Try with the configuration I specified.

And can you try 100*100 for all of them? We could make that default if it works for all of them

It is running for all models. We can do that but we won't be able to run pretrained models as they take images of size larger than 100x100.

marco-c commented 6 years ago

The size should be 2×3046×224×224×4 bytes, so 1222688768 bytes, so 1194032 KB, so ~1166 MB, so ~1,14 GB. Maybe there's something else causing these OOMs?

Shashi456 commented 6 years ago

@marco-c i had an oom error with 3.69 gb of gpu memory available, now that you've told about the size i wonder what could be the cause?

If we are doing it on a gpu we could check out the tensorflow allow_growth configuration

sagarvijaygupta commented 6 years ago

@marco-c I think in

2×3046×224×224×4 bytes

we missed a factor of 3 (for the channels which makes it 3668066304 ~3.4 GB which is same as tmalloc warning). As I mentioned tmalloc part was just a warning but the training was getting killed. So issue should not be with the memory but something else.

marco-c commented 6 years ago

Right, I forgot that :smile: . But even then, the images should be loaded progressively, so there shouldn't be such a single large allocation.

sagarvijaygupta commented 6 years ago

It is because we load all images at once.

in get_ImageDataGenerator (called from train.py with

data_gen = utils.get_ImageDataGenerator(all_images, input_shape)

takes all_images where we load all of them into x

x = np.zeros((len(images),) + image_shape, dtype=keras.backend.floatx())

When I had printed x.nbytes it show 3666862080 which is same as tmalloc warning.

marco-c commented 6 years ago

It is because we load all images at once.

We don't actually load them all at once, I think the issue is this allocation: https://github.com/marco-c/autowebcompat/blob/051bfefd26e77f4aea902320e8ae9f1a35213d7e/autowebcompat/utils.py#L99.

sagarvijaygupta commented 6 years ago

In the next line

https://github.com/marco-c/autowebcompat/blob/051bfefd26e77f4aea902320e8ae9f1a35213d7e/autowebcompat/utils.py#L101

i think we do load them into x.

marco-c commented 6 years ago

i think we do load them into x.

Yes, but we load them progressively and not all at once.

marco-c commented 6 years ago

So there are two big allocations here: 1) https://github.com/marco-c/autowebcompat/blob/051bfefd26e77f4aea902320e8ae9f1a35213d7e/autowebcompat/utils.py#L99 2) https://github.com/keras-team/keras/blob/2.1.1/keras/preprocessing/image.py#L688

marco-c commented 6 years ago

We could temporarily comment out https://github.com/marco-c/autowebcompat/blob/051bfefd26e77f4aea902320e8ae9f1a35213d7e/autowebcompat/utils.py#L99-L104, after all we are not currently doing any of the things for which fit would be needed.

sagarvijaygupta commented 6 years ago

As discussed on IRC to run model with target shape (224x224) on google colab you will require following config:

Do not cache images in load_imageie,

def load_image(fname, parent_dir='data_resized'):
 img = load_img(os.path.join(parent_dir, fname), target_size=(224, 224))
 x = img_to_array(img, data_format=keras.backend.image_data_format()
 return x

marco-c commented 6 years ago

Maybe 3 is not actually needed, can you retest with only 1 and 2?

sagarvijaygupta commented 6 years ago

Yes, 3 is not needed. Though when I run the script, I do get a tmalloc warning but training continues after that, without being killed.

sagarvijaygupta commented 6 years ago

Update: With new notebook 1 is also not needed. :smile:

marco-c / autowebcompat

Error while trying to run vgg16 with input shape (224, 224) #191