upgrade software versions

CarolinaFurtado commented 3 years ago

Tensorflow 2.4 just came out on Dec 14 2020. We should upgrade to Cuda 11, upgrade ubuntu, etc

"TensorFlow 2.4 runs with CUDA 11 and cuDNN 8, enabling support for the newly available NVIDIA Ampere GPU architecture. To learn more about CUDA 11 features, check out this NVIDIA developer blog."

CarolinaFurtado commented 3 years ago

install cuda 11.0

https://developer.nvidia.com/cuda-11.0-update1-download-archive?target_os=Linux&target_arch=x86_64&target_distro=Ubuntu&target_version=1804&target_type=deblocal

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-ubuntu1804.pin
sudo mv cuda-ubuntu1804.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/11.0.3/local_installers/cuda-repo-ubuntu1804-11-0-local_11.0.3-450.51.06-1_amd64.deb
sudo dpkg -i cuda-repo-ubuntu1804-11-0-local_11.0.3-450.51.06-1_amd64.deb
sudo apt-key add /var/cuda-repo-ubuntu1804-11-0-local/7fa2af80.pub
sudo apt-get update
sudo apt-get -y install cuda

install cudnn 8

wget http://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64/libcudnn8_8.0.2.39-1+cuda11.0_amd64.deb
sudo dpkg -i libcudnn8_8.0.2.39-1+cuda11.0_amd64.deb

wget http://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64/libcudnn8-dev_8.0.2.39-1+cuda11.0_amd64.deb
sudo dpkg -i libcudnn8-dev_8.0.2.39-1+cuda11.0_amd64.deb

~

ubuntu 18.04

gcloud compute images list on google cloud sdk

projects/ubuntu-os-cloud/global/images/ubuntu-1804-bionic-v20201211a

Default python version

# install needed packages
sudo apt-get install -y cmake \
    git \
    python3-setuptools \
    python3-dev \
    python3-pip \
    libopencv-dev \
    htop \
    tmux \
    tree \
    p7zip-full

pip3 install --upgrade pip
pip3 install --upgrade setuptools
pip3 uninstall crcmod -y
pip3 install --no-cache-dir crcmod
pip3 install --upgrade pyasn1
cd necstlab-damage-segmentation && pip3 install -r requirements.txt

or specific version:

# Install requirements
sudo apt-get install -y \
    checkinstall\
    libreadline-gplv2-dev\
    liblzma-dev\
    libncursesw5-dev\
    libssl-dev\
    libsqlite3-dev\
    tk-dev\
    libgdbm-dev\
    libc6-dev\
    libbz2-dev\
    zlib1g-dev\
    openssl\
    libffi-dev\
    python3-dev\
    python3-setuptools\
    wget\
    zlib1g-dev

# install needed packages
sudo apt-get install -y cmake \
    git \
    libopencv-dev \
    htop \
    tmux \
    tree \
    p7zip-full

cd ~
mkdir tmp
cd tmp
wget https://www.python.org/ftp/python/3.7.9/Python-3.7.9.tgz
tar zxvf Python-3.7.9.tgz
cd Python-3.7.9
./configure --prefix=$HOME/opt/python-3.7.9
make
make install
cd ~
echo 'export PATH=$HOME/opt/python-3.7.9/bin:$PATH' >> .bash_profile
. ~/.bash_profile
cd ~

pip3 install -U pip
pip3 install --upgrade setuptools
pip3 install --no-cache-dir crcmod
pip3 install --upgrade pyasn1
cd necstlab-damage-segmentation && pip3 install -r requirements.txt

CarolinaFurtado commented 3 years ago

result comparison:

train

generate pretrained model:

python3 train_segmentation_model.py --gcp-bucket gs://necstlab-sandbox --config-file configs/config_sandbox/tf_version_debug//train-small-3class_tfversion_debug.yaml output: segmentation-model-small-3class_tfversion_debug_20201216T181455Z

tf 2.3

python3 train_segmentation_model.py --gcp-bucket gs://necstlab-sandbox --config-file configs/config_sandbox/tf_version_debug//train-small-3class_tfversion_debug.yaml --random-module-global-seed 1 --numpy-random-global-seed 1 --tf-random-global-seed 1 --pretrained-model-id segmentation-model-small-3class_tfversion_debug_20201216T181455Z

output: segmentation-model-small-3class_tfversion_debug_20201216T183905Z

tf 2.4

python3 train_segmentation_model.py --gcp-bucket gs://necstlab-sandbox --config-file configs/config_sandbox/tf_version_debug//train-small-3class_tfversion_debug.yaml --random-module-global-seed 1 --numpy-random-global-seed 1 --tf-random-global-seed 1 --pretrained-model-id segmentation-model-small-3class_tfversion_debug_20201216T181455Z

output: segmentation-model-small-3class_tfversion_debug_20201216T183858Z

results:

loss is not exactly equal because it is run in gpu - but very very similar. Val loss is super different, any idea why @rak5216? the pretrained model only has 3 epochs, not sure is that has anything to do with it

train_thresholds

tf 2.3 python3 train_segmentation_model_prediction_thresholds.py --gcp-bucket gs://necstlab-sandbox --dataset-directory dataset-small-3class_tfversion_debug/validation --model-id segmentation-model-small-3class_tfversion_debug_20201216T181455Z --batch-size 16 --optimizing-class-metric iou_score --dataset-downsample-factor 0.1 --numpy-random-global-seed 1 --tf-random-global-seed 1

output: train_thresholds_segmentation-model-small-3class_tfversion_debug_20201216T181455Z_dataset-small-3class_tfversion_debug_iou_score/model_thresholds_20201216T190042Z.yaml

 Train Prediction Thresholds Results:
{'class0': 0.9952027292893504, 'class1': 0.9952027292893504, 'class2': 0.9952027292893504}

tf 2.4 python3 train_segmentation_model_prediction_thresholds.py --gcp-bucket gs://necstlab-sandbox --dataset-directory dataset-small-3class_tfversion_debug/validation --model-id segmentation-model-small-3class_tfversion_debug_20201216T181455Z --batch-size 16 --optimizing-class-metric iou_score --dataset-downsample-factor 0.1 --numpy-random-global-seed 1 --tf-random-global-seed 1

output: train_thresholds_segmentation-model-small-3class_tfversion_debug_20201216T181455Z_dataset-small-3class_tfversion_debug_iou_score/model_thresholds_20201216T190039Z.yaml

Train Prediction Thresholds Results:
{'class0': 0.9952027292893504, 'class1': 0.9952027292893504, 'class2': 0.9952027292893504}

Results: exactly the same

test

tf 2.3 python3 test_segmentation_model.py --gcp-bucket gs://necstlab-sandbox --dataset-id dataset-small-3class_tfversion_debug --model-id segmentation-model-small-3class_tfversion_debug_20201216T181455Z --batch-size 16 --trained-thresholds-id model_thresholds_20201216T190039Z.yaml

output: segmentation-model-small-3class_tfversion_debug_20201216T181455Z_dataset-small-3class_tfversion_debug/metrics_20201216T190706Z.csv

tf 2.4 python3 test_segmentation_model.py --gcp-bucket gs://necstlab-sandbox --dataset-id dataset-small-3class_tfversion_debug --model-id segmentation-model-small-3class_tfversion_debug_20201216T181455Z --batch-size 16 --trained-thresholds-id model_thresholds_20201216T190039Z.yaml

output: tests/segmentation-model-small-3class_tfversion_debug_20201216T181455Z_dataset-small-3class_tfversion_debug/metrics_20201216T190710Z.csv

results: same

CarolinaFurtado commented 3 years ago

@rak5216 the release notes for tf 2.4 don't say anything that seems alarming: https://github.com/tensorflow/tensorflow/releases can you take a look?

and the results are ok, just the validation loss is different but I'm not sure the cause is the tf version.

rak5216 commented 3 years ago

i didnt notice any concerning breaking changes, tho perhaps this new mixed precision thing during training can cause more deviation in comparing model learning. for such small datasets and limited epochs, it seems feasible to me that even a tiny difference in model weights on the train set can translate to major performance changes on the val set. did we see this val loss behavior before on tf 2.1 vs tf 2.3 gpu when we had similar train loss? maybe i'll just go check that issue too

rak5216 commented 3 years ago

we should verify that tf 2.4 training is repeatable too. it seems that tf 2.1 vs tf 2.3 also wasnt exactly repeatable, even on cpu.

rak5216 commented 3 years ago

if u ran tf 2.4 on cpu using the same setup as th ebaove image, i would hope (and think indeed this would happen based on your results already) that tf 2.4 would look similar. we should also compile these results somewhere, so we only have to run the new tf versions and then add that data to compare prior versions with identical setup

CarolinaFurtado commented 3 years ago

training is repeatable in tf 23 and 24 in cpu

but the validation loss is not equal from tf 2.3 to 2.4 (as it was also not from 2.1 to 2.3

I now created a database for future tf versions so we can keep track of this on google drive (https://drive.google.com/drive/u/2/folders/1PRFrhADMOc6GtscObqbSNg85rFjtBLPm). I am not 100% sure about the results I saved for 2.1 (and I can't repeat them now because somehow some package stopped working with tf 2.1 and I don't think it's worth to spend too long on this since we are no longer using it anyway. Please disagree if needed, @rak5216 )

the train_thresholds and test we have already confirmed it is the same even with gpu, but for the sake of completeness of the benchmark, I'll run those too

mit-quest / necstlab-damage-segmentation