alloc Error when trying to train the model

fuzzballb commented 7 years ago

I have generated a lot of images by driving around. But when i try to train it, i keep getting the issue below, even when i am training on a cloud image with 56GB of ram.

computations.
Training Samples: 100%|¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦| 24576/24576 [00:55<00:00, 461.85it/s]
Manual Fit. Epoch 00/05: loss:   1069.2 - val_loss    606.2
Training Samples: 100%|¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦| 24576/24576 [00:52<00:00, 465.50it/s]
Manual Fit. Epoch 01/05: loss:    510.4 - val_loss    406.8
Training Samples: 100%|¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦| 24576/24576 [00:53<00:00, 465.65it/s]
Manual Fit. Epoch 02/05: loss:    342.5 - val_loss    250.9
Training Samples: 100%|¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦| 24576/24576 [00:53<00:00, 464.12it/s]
Manual Fit. Epoch 03/05: loss:    244.7 - val_loss    196.1
Training Samples: 100%|¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦| 24576/24576 [00:53<00:00, 461.83it/s]
Manual Fit. Epoch 04/05: loss:    188.0 - val_loss    143.4
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
Aborted (core dumped)

I have done changed the keras module name to get it to work renamed : from keras.utils.vis_utils import plot_model

manavkataria commented 7 years ago

Thanks for letting me know, @fuzzballb. It sounds like you're saying its a Keras versioning issue. What version of Keras are you using? If you send me a PR, I can review your contrib and merge.

Thanks, Manav

fuzzballb commented 7 years ago

Hi Manav,

My last comments was just to make clear that the Keras.util.visualisation module has been renamed http://stackoverflow.com/questions/43511819/importerror-no-module-named-keras-utils-visualize-util

But i don't think that has anything to do with the 'out of memory error' that i am getting. For some reason, saving the model raises this memory error. But i have no idea why. I tried changing the sample size form 1024 to 512, but that didn't help ether.

I have also tried reducing the image size, in settings.py, but then it won't start training because i get this error.

ValueError: Error when checking input: expected Conv1_input to have shape (None, 66, 200, 1) but got array with shape (512, 33, 100, 1)

I have no idea why it still expects the shape (None, 66, 200, 1). It is not defined in the code, or the settings.

As mentioned befor it is not likely that it my cloud VM doesn't have enough RAM.

Hope you can help.

manavkataria commented 7 years ago

Can you please share the stacktrace for the out of memory error? Your first comment does not mention it.

fuzzballb commented 7 years ago

Settings for model save testing epoch: 1 batch size: 64

Running Fit Generator (Manual=True)
Training Samples: 100%|████████████████████████████████████████████| 24576/24576 [00:53<00:00, 459.47it/s]
Manual Fit. Epoch 00/01: loss:    175.8 - val_loss     56.3
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
Aborted (core dumped)

Is all i get. It seems to be a C++ error for when you run out of memory. My guess is that the library used for saving this model rases this error, but i dont know where to get more information.

UPDATE I have also trained the model on a Windows VM and even tough i got the following Error, it did train the model tough.

Traceback (most recent call last):
  File "model.py", line 411, in <module>
    main()
  File "model.py", line 403, in main
    pickle.dump([history.history, X_balanced, y_balanced, y_train], open('save/hist_xy.p', 'wb'))
OverflowError: cannot serialize a bytes object larger than 4 GiB

What is the "hist_xy.p" used for?
If you have any idea about why saving the model craches on Linux, that would help a lot. My hunch is that it has something to do with "pickle" might swap it out for "streaming-pickle"
Being able to change the image size would also help.

Thanks in advance

manavkataria commented 7 years ago

Looks like you've modified the original model.py. I don't have line #403 in my code. I have not encountered that error in my training/test runs on Ubuntu or MacOSX. The objective of the equivalent line at [1] is to dump the list [history.history, X_balanced, y_balanced, y_train] to a pickle file hist_xy.p to be retrieved later by the plotter for visualization [2].

[1] https://github.com/manavkataria/behavioral_cloning_car/blob/master/model.py#L391 [2] https://github.com/manavkataria/behavioral_cloning_car/blob/master/plotter.py#L75

manavkataria / behavioral_cloning_car

alloc Error when trying to train the model #3