Fitting model only performs on one epoch and then exits

vlawhern / arl-eegmodels

This is the Army Research Laboratory (ARL) EEGModels Project: A Collection of Convolutional Neural Network (CNN) models for EEG signal classification, using Keras and Tensorflow

Other

1.14k stars 284 forks source link

Fitting model only performs on one epoch and then exits #22

Closed moonj94 closed 3 years ago

moonj94 commented 3 years ago

when I run the ERP.py script (after fixing other issues flagged by previous posts), the script runs on one epoch based on this line: fittedModel = model.fit(X_train, Y_train, batch_size = 16, epochs = 300, verbose = 2, validation_data=(X_validate, Y_validate), callbacks=[checkpointer], class_weight = class_weights)

This is the console output: Train on 144 samples, validate on 72 samples Epoch 1/300

Could you please help me fix this issue?

Thanks!

vlawhern commented 3 years ago

Hi there,

So the script works for me out of the box. If it works for one epoch only then it looks like it couldn't save the checkpoint file... did you change the checkpointPath (https://github.com/vlawhern/arl-eegmodels/blob/master/examples/ERP.py#L158)? That path only exists for Linux-based systems.

moonj94 commented 3 years ago

I haven't changed the checkpoint path, no. Can I change it to any path I see fit?

EDIT: I changed it to the path that contains ERP.py and the same situation arises. Only 1 epoch is completed and it exits.

vlawhern commented 3 years ago

Can you provide more of the error trace?

Note that you'll also need to change the checkpointPath line here (https://github.com/vlawhern/arl-eegmodels/blob/master/examples/ERP.py#L182) but your error is occurring before this so this probably isn't the problem here...

vlawhern commented 3 years ago

Another thing to try is to remove the checkpointer callback to verify if it's something with checkpointing from the model.fit:


fittedModel = model.fit(X_train, Y_train, batch_size = 16, epochs = 300, 
                        verbose = 2, validation_data=(X_validate, Y_validate), class_weight = class_weights)

moonj94 commented 3 years ago

This is what I see: runfile('/Users/jaemoon/Desktop/SPCSEEGanalysis.git/CWT_betagamma/arl-eegmodels/examples/ERP.py', wdir='/Users/jaemoon/Desktop/SPCSEEGanalysis.git/CWT_betagamma/arl-eegmodels/examples') X_train shape: (144, 1, 60, 151) 144 train samples 72 test samples 2020-12-07 11:49:46.609840: I tensorflow/core/platform/cpu_feature_guard.cc:145] This TensorFlow binary is optimized with Intel(R) MKL-DNN to use the following CPU instructions in performance critical operations: SSE4.1 SSE4.2 AVX AVX2 FMA To enable them in non-MKL-DNN operations, rebuild TensorFlow with the appropriate compiler flags. 2020-12-07 11:49:46.610221: I tensorflow/core/common_runtime/process_util.cc:115] Creating new thread pool with default inter op setting: 12. Tune using inter_op_parallelism_threads for best performance. Train on 144 samples, validate on 72 samples Epoch 1/300

You're right I don't think the error has to do with the checkpoints.

Removing the checkpoint caller

vlawhern commented 3 years ago

At this point I'm not sure what's happening.. it works fine for me.

How did you install tensorflow, and what version are you using? I'm using Anaconda Python with tensorflow installed through the conda package manager, so not the pip version. I'm using tensorflow-gpu=2.1.0 and on python=3.7.6 although that shouldn't matter (I've tested it on many other python and tensorflow versions without any problems).

moonj94 commented 3 years ago

I just did conda install tensorflow. Can you give me some detailed steps to how you installed it?

Sorry, I should have clarified that I am fairly rusty at Python.

vlawhern commented 3 years ago

So `conda install tensorflow' installs the CPU version of Tensorflow, which causes some problems due to certain operations not being supported on CPU (AveragePooling2D). If this is your situation there is a workaround.

If you have an Nvidia GPU you can install the GPU version of tensorflow as conda install tensorflow-gpu.

so to get the exact same environment as I used for testing you would do the following

conda create --name test python=3.7 anaconda=2020.07
conda activate test
conda install tensorflow-gpu=2.1.0
pip install mne
pip install pyriemann

moonj94 commented 3 years ago

I tried conda install tensorflow-gpu=2.1.0 but I am getting an error saying: `PackagesNotFoundError: The following packages are not available from current channels:

tensorflow-gpu=2.1.0`

Can this only be run on windows/linux?

vlawhern commented 3 years ago

I've only tested this on Linux with a GPU, specifically Ubuntu 18.04 although other versions should work just fine. I'm not sure about Windows support as I understand installing the Nvidia drivers can be challenging.

Are you on Mac? that would force you to use the CPU version then since there is no GPU version for Macs. If so then I'll need to push a new version of this code that works for both CPUs and GPUs (which isn't hard, and you're not the first person to run into the CPU vs GPU problem.. )

moonj94 commented 3 years ago

Yes, I am on a mac. If you could push the code to support CPU I would very much appreciate it!

vlawhern commented 3 years ago

OK I pushed a change to the code, and did a quick test both on tensorflow-cpu and tensorflow-gpu and it worked. Pull down the latest commit and try the sample script again.. it "should" work as is (but remember to change the checkpointPaths as needed)

moonj94 commented 3 years ago

It worked! Thank you so much for your prompt replies and help.

vlawhern commented 3 years ago

Good to hear. Closing the issue but let me know if other issues arise.