How to train the demo? - Githubissues

lxy-94 commented 7 years ago

Hello, I have reead your paper, and I try to train the sunnybrook dataset in my conputer after reading your tutorial, but after 40 epochs, the loss, accuracy and the jaccard coefficient are the same as 1 epoch, how can I do to increase my accuracy by train the Sunnybrook dataset? Thank you very much for your paper and tutorial, and I am looking forward to get your reply.

vuptran commented 7 years ago

Did you run train_sunnybrook.py as is? Also, check to see if the function cv2.fillPoly is correctly reading in the contour files from the Sunnybrook dataset by inserting some print lines. I'm using OpenCV 3.1

tpapastylianou commented 7 years ago

@vuptran I am having the same / similar problem. The result after the for loop at line 145 (in train_sunnybrook.py) is simply [nan, 0.0, nan, nan].

In response to your question about cv2.fillPoly, yes that seems to work as intended.

Running on Linux Mint 18.2, 64bit. Some file rearrangements / renaming was necessary in the Sunnybrook dataset, but otherwise the code runs fine, but just gives 'nan' results as early as epoch 1.

vuptran commented 7 years ago

I'll take a look at this. It didn't happen in my environment listed in the README.md.

In the meantime, could you reduce the learning rate specified in fcn_model.py to see if nan still appears?

tpapastylianou commented 7 years ago

Yes it does. I've tried down to lr = 0.00001. Here's a typical run:

tasos@tasos-VanB ~/Desktop/cardiac-segmentation-master $ ./train_sunnybrook.py i 0 
/usr/local/lib/python2.7/dist-packages/dicom/__init__.py:53: UserWarning: 
This code is using an older version of pydicom, which is no longer 
maintained as of Jan 2017.  You can access the new pydicom features and API 
by installing `pydicom` from PyPI.
See 'Transitioning to pydicom 1.x' section at pydicom.readthedocs.org 
for more information.

  warnings.warn(msg)
Using TensorFlow backend.
Mapping ground truth i contours to images in train...
Shuffling data
Number of examples: 260
Done mapping training set

Building Train dataset ...

Processing 234 images and labels ...

Building Dev dataset ...

Processing 26 images and labels ...

2017-09-04 05:11:25.356507: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-09-04 05:11:25.356530: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-09-04 05:11:25.356534: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-09-04 05:11:25.356537: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-09-04 05:11:25.356540: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.

Main Epoch 1

Learning rate: 0.000010

Train result ['loss', 'acc', 'dice_coef', 'jaccard_coef']:
[        nan  0.02561367         nan         nan]

Evaluating dev set ...
26/26 [==============================] - 4s

Dev set result ['loss', 'acc', 'dice_coef', 'jaccard_coef']:
[ nan   0.  nan  nan]

Saving model weights to model_logs/sunnybrook_i_epoch_1.h5

Main Epoch 2

Learning rate: 0.000010

Train result ['loss', 'acc', 'dice_coef', 'jaccard_coef']:
[ nan   0.  nan  nan]

Evaluating dev set ...
26/26 [==============================] - 3s

Dev set result ['loss', 'acc', 'dice_coef', 'jaccard_coef']:
[ nan   0.  nan  nan]

Saving model weights to model_logs/sunnybrook_i_epoch_2.h5

Main Epoch 3

... etc for 40 epochs.

Environment:

Linux Mint 18.2 (= Ubuntu 16.04 x86_64)
python 2.7.12
keras 2.0.8 using Tensorflow 1.3.0
not using CUDA / CuDNN _{(I assume this isn't strictly necessary, but I can install the libraries and try again)}
opencv-python 3.2.0.7
h5py 2.7.0
numpy 1.13.0
pydicom 0.9.9
scikit-image 0.13.0

tpapastylianou commented 7 years ago

Dear @vuptran

I have tried again using tensorflow-gpu / cuda / cudnn (instead of the standard tensorflow installation available through pip), and I can confirm that I now get the intended results (though, interestingly enough, not an identical contour to the one generated from the example weights provided with the code -- I am right in thinking one simply copies the result from the last epoch into the weights folder and renames accordingly, right?).

Any idea why the code fails when run on the cpu? Is this a bug, or was your code written exclusively for gpu use? (I haven't spotted anything suggesting this in the code, but admittedly I haven't gone through it in that much detail).

vuptran commented 7 years ago

I'm glad this works out. Computation between CPU and GPU should have been automatic in tensorflow. There is no explicit GPU declaration in the code. I noticed that tensorflow compilation is very specific. For example. tensorflow compiled on a VM with a K80 GPU does not properly import when it is copied over to a different VM using a different card.

The packaged weights for the Sunnybrook model were produced by training the model on the entire training set, not split into train/val sets. This improves the model slightly as it learns from more data.

vuptran / cardiac-segmentation

How to train the demo? #1