Closed CarolinaFurtado closed 3 years ago
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-ubuntu1804.pin
sudo mv cuda-ubuntu1804.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/11.0.3/local_installers/cuda-repo-ubuntu1804-11-0-local_11.0.3-450.51.06-1_amd64.deb
sudo dpkg -i cuda-repo-ubuntu1804-11-0-local_11.0.3-450.51.06-1_amd64.deb
sudo apt-key add /var/cuda-repo-ubuntu1804-11-0-local/7fa2af80.pub
sudo apt-get update
sudo apt-get -y install cuda
wget http://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64/libcudnn8_8.0.2.39-1+cuda11.0_amd64.deb
sudo dpkg -i libcudnn8_8.0.2.39-1+cuda11.0_amd64.deb
wget http://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64/libcudnn8-dev_8.0.2.39-1+cuda11.0_amd64.deb
sudo dpkg -i libcudnn8-dev_8.0.2.39-1+cuda11.0_amd64.deb
~
gcloud compute images list
on google cloud sdk
projects/ubuntu-os-cloud/global/images/ubuntu-1804-bionic-v20201211a
# install needed packages
sudo apt-get install -y cmake \
git \
python3-setuptools \
python3-dev \
python3-pip \
libopencv-dev \
htop \
tmux \
tree \
p7zip-full
pip3 install --upgrade pip
pip3 install --upgrade setuptools
pip3 uninstall crcmod -y
pip3 install --no-cache-dir crcmod
pip3 install --upgrade pyasn1
cd necstlab-damage-segmentation && pip3 install -r requirements.txt
# Install requirements
sudo apt-get install -y \
checkinstall\
libreadline-gplv2-dev\
liblzma-dev\
libncursesw5-dev\
libssl-dev\
libsqlite3-dev\
tk-dev\
libgdbm-dev\
libc6-dev\
libbz2-dev\
zlib1g-dev\
openssl\
libffi-dev\
python3-dev\
python3-setuptools\
wget\
zlib1g-dev
# install needed packages
sudo apt-get install -y cmake \
git \
libopencv-dev \
htop \
tmux \
tree \
p7zip-full
cd ~
mkdir tmp
cd tmp
wget https://www.python.org/ftp/python/3.7.9/Python-3.7.9.tgz
tar zxvf Python-3.7.9.tgz
cd Python-3.7.9
./configure --prefix=$HOME/opt/python-3.7.9
make
make install
cd ~
echo 'export PATH=$HOME/opt/python-3.7.9/bin:$PATH' >> .bash_profile
. ~/.bash_profile
cd ~
pip3 install -U pip
pip3 install --upgrade setuptools
pip3 install --no-cache-dir crcmod
pip3 install --upgrade pyasn1
cd necstlab-damage-segmentation && pip3 install -r requirements.txt
python3 train_segmentation_model.py --gcp-bucket gs://necstlab-sandbox --config-file configs/config_sandbox/tf_version_debug//train-small-3class_tfversion_debug.yaml
output: segmentation-model-small-3class_tfversion_debug_20201216T181455Z
python3 train_segmentation_model.py --gcp-bucket gs://necstlab-sandbox --config-file configs/config_sandbox/tf_version_debug//train-small-3class_tfversion_debug.yaml --random-module-global-seed 1 --numpy-random-global-seed 1 --tf-random-global-seed 1 --pretrained-model-id segmentation-model-small-3class_tfversion_debug_20201216T181455Z
output: segmentation-model-small-3class_tfversion_debug_20201216T183905Z
python3 train_segmentation_model.py --gcp-bucket gs://necstlab-sandbox --config-file configs/config_sandbox/tf_version_debug//train-small-3class_tfversion_debug.yaml --random-module-global-seed 1 --numpy-random-global-seed 1 --tf-random-global-seed 1 --pretrained-model-id segmentation-model-small-3class_tfversion_debug_20201216T181455Z
output: segmentation-model-small-3class_tfversion_debug_20201216T183858Z
loss is not exactly equal because it is run in gpu - but very very similar. Val loss is super different, any idea why @rak5216? the pretrained model only has 3 epochs, not sure is that has anything to do with it
tf 2.3
python3 train_segmentation_model_prediction_thresholds.py --gcp-bucket gs://necstlab-sandbox --dataset-directory dataset-small-3class_tfversion_debug/validation --model-id segmentation-model-small-3class_tfversion_debug_20201216T181455Z --batch-size 16 --optimizing-class-metric iou_score --dataset-downsample-factor 0.1 --numpy-random-global-seed 1 --tf-random-global-seed 1
output: train_thresholds_segmentation-model-small-3class_tfversion_debug_20201216T181455Z_dataset-small-3class_tfversion_debug_iou_score/model_thresholds_20201216T190042Z.yaml
Train Prediction Thresholds Results:
{'class0': 0.9952027292893504, 'class1': 0.9952027292893504, 'class2': 0.9952027292893504}
tf 2.4
python3 train_segmentation_model_prediction_thresholds.py --gcp-bucket gs://necstlab-sandbox --dataset-directory dataset-small-3class_tfversion_debug/validation --model-id segmentation-model-small-3class_tfversion_debug_20201216T181455Z --batch-size 16 --optimizing-class-metric iou_score --dataset-downsample-factor 0.1 --numpy-random-global-seed 1 --tf-random-global-seed 1
output: train_thresholds_segmentation-model-small-3class_tfversion_debug_20201216T181455Z_dataset-small-3class_tfversion_debug_iou_score/model_thresholds_20201216T190039Z.yaml
Train Prediction Thresholds Results:
{'class0': 0.9952027292893504, 'class1': 0.9952027292893504, 'class2': 0.9952027292893504}
tf 2.3
python3 test_segmentation_model.py --gcp-bucket gs://necstlab-sandbox --dataset-id dataset-small-3class_tfversion_debug --model-id segmentation-model-small-3class_tfversion_debug_20201216T181455Z --batch-size 16 --trained-thresholds-id model_thresholds_20201216T190039Z.yaml
output: segmentation-model-small-3class_tfversion_debug_20201216T181455Z_dataset-small-3class_tfversion_debug/metrics_20201216T190706Z.csv
tf 2.4
python3 test_segmentation_model.py --gcp-bucket gs://necstlab-sandbox --dataset-id dataset-small-3class_tfversion_debug --model-id segmentation-model-small-3class_tfversion_debug_20201216T181455Z --batch-size 16 --trained-thresholds-id model_thresholds_20201216T190039Z.yaml
output: tests/segmentation-model-small-3class_tfversion_debug_20201216T181455Z_dataset-small-3class_tfversion_debug/metrics_20201216T190710Z.csv
@rak5216 the release notes for tf 2.4 don't say anything that seems alarming: https://github.com/tensorflow/tensorflow/releases can you take a look?
and the results are ok, just the validation loss is different but I'm not sure the cause is the tf version.
i didnt notice any concerning breaking changes, tho perhaps this new mixed precision thing during training can cause more deviation in comparing model learning. for such small datasets and limited epochs, it seems feasible to me that even a tiny difference in model weights on the train set can translate to major performance changes on the val set. did we see this val loss behavior before on tf 2.1 vs tf 2.3 gpu when we had similar train loss? maybe i'll just go check that issue too
we should verify that tf 2.4 training is repeatable too. it seems that tf 2.1 vs tf 2.3 also wasnt exactly repeatable, even on cpu.
if u ran tf 2.4 on cpu using the same setup as th ebaove image, i would hope (and think indeed this would happen based on your results already) that tf 2.4 would look similar. we should also compile these results somewhere, so we only have to run the new tf versions and then add that data to compare prior versions with identical setup
training is repeatable in tf 23 and 24 in cpu
but the validation loss is not equal from tf 2.3 to 2.4 (as it was also not from 2.1 to 2.3
I now created a database for future tf versions so we can keep track of this on google drive (https://drive.google.com/drive/u/2/folders/1PRFrhADMOc6GtscObqbSNg85rFjtBLPm). I am not 100% sure about the results I saved for 2.1 (and I can't repeat them now because somehow some package stopped working with tf 2.1 and I don't think it's worth to spend too long on this since we are no longer using it anyway. Please disagree if needed, @rak5216 )
the train_thresholds and test we have already confirmed it is the same even with gpu, but for the sake of completeness of the benchmark, I'll run those too
Tensorflow 2.4 just came out on Dec 14 2020. We should upgrade to Cuda 11, upgrade ubuntu, etc
"TensorFlow 2.4 runs with CUDA 11 and cuDNN 8, enabling support for the newly available NVIDIA Ampere GPU architecture. To learn more about CUDA 11 features, check out this NVIDIA developer blog."