Closed sagarvijaygupta closed 5 years ago
Isn't this similar to #179?
@Shashi456 We load all the images in get_ImageDataGenerator
which presently is 2 * 3046
into a given size of 224x224
(we can't keep size to be 32x24 as @marco-c mentioned that it was arbitrary). Since all images are loaded into memory I think this results in memory crash. #179 showed inconsistency with shape. This shows memory crash if I change shape.
@sagarvijaygupta i'd like to replicate this bug, you just rain train.py right?
Just change autowebcompat/utils.py
load_image
target size
to 224x224
and run train.py to for vgg16
Update: I am able to run inception model with size (32x24) but it fails with size of (100x100) without even giving tcmalloc
warning.
Update: I was able to run vgg16 model but with size (100x100)
and batch size = 1
. Google Collab has nearly 13 GB of RAM.
@sagarvijaygupta I've been running into allocation issues , but I don't think it's because of the size. I was trying it locally on my PC. I'll try it again after rectifying the issue and update here.
And can you try 100*100 for all of them? We could make that default if it works for all of them
@Shashi456 Try with the configuration I specified.
And can you try 100*100 for all of them? We could make that default if it works for all of them
It is running for all models. We can do that but we won't be able to run pretrained models as they take images of size larger than 100x100
.
The size should be 2×3046×224×224×4 bytes, so 1222688768 bytes, so 1194032 KB, so ~1166 MB, so ~1,14 GB. Maybe there's something else causing these OOMs?
@marco-c i had an oom error with 3.69 gb of gpu memory available, now that you've told about the size i wonder what could be the cause?
If we are doing it on a gpu we could check out the tensorflow allow_growth
configuration
@marco-c I think in
2×3046×224×224×4 bytes
we missed a factor of 3 (for the channels which makes it 3668066304 ~3.4 GB which is same as tmalloc warning). As I mentioned tmalloc part was just a warning but the training was getting killed. So issue should not be with the memory but something else.
Right, I forgot that :smile: . But even then, the images should be loaded progressively, so there shouldn't be such a single large allocation.
It is because we load all images at once.
in get_ImageDataGenerator
(called from train.py with
data_gen = utils.get_ImageDataGenerator(all_images, input_shape)
takes all_images where we load all of them into x
x = np.zeros((len(images),) + image_shape, dtype=keras.backend.floatx())
When I had printed x.nbytes it show 3666862080
which is same as tmalloc warning.
It is because we load all images at once.
We don't actually load them all at once, I think the issue is this allocation: https://github.com/marco-c/autowebcompat/blob/051bfefd26e77f4aea902320e8ae9f1a35213d7e/autowebcompat/utils.py#L99.
In the next line
i think we do load them into x.
i think we do load them into x.
Yes, but we load them progressively and not all at once.
We could temporarily comment out https://github.com/marco-c/autowebcompat/blob/051bfefd26e77f4aea902320e8ae9f1a35213d7e/autowebcompat/utils.py#L99-L104, after all we are not currently doing any of the things for which fit
would be needed.
As discussed on IRC to run model with target shape (224x224) on google colab you will require following config:
images
in load_image
ie,
def load_image(fname, parent_dir='data_resized'):
img = load_img(os.path.join(parent_dir, fname), target_size=(224, 224))
x = img_to_array(img, data_format=keras.backend.image_data_format()
return x
Maybe 3 is not actually needed, can you retest with only 1 and 2?
Yes, 3 is not needed. Though when I run the script, I do get a tmalloc warning but training continues after that, without being killed.
Update: With new notebook 1 is also not needed. :smile: