mrdbourke / tensorflow-deep-learning

All course materials for the Zero to Mastery Deep Learning with TensorFlow course.
https://dbourke.link/ZTMTFcourse
MIT License
5.14k stars 2.53k forks source link

Speed of fitting model is hours for some reason, 07_food_vision_milestone_project. Also Found a resolution for the mixed precision float16 - float32 error #494

Open Jayms8462 opened 1 year ago

Jayms8462 commented 1 year ago
  Also running into a problem. It appears that the ETA for fitting the model after downgrading to 2.4.1 is running at like 100 hours even with using the Nvida T4. Even running the project directly from github https://github.com/mrdbourke/tensorflow-deep-learning/blob/main/07_food_vision_milestone_project_1.ipynb it still is getting 100+ hours

Saving TensorBoard log files to: training_logs/efficientnetb0_101_classes_all_data_feature_extract/20221227-085937 Epoch 1/3 3/2368 [..............................] - ETA: 97:10:20 - loss: nan - accuracy: 0.0382

Tue Dec 27 08:59:33 2022
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla T4 Off | 00000000:00:04.0 Off | 0 | | N/A 66C P0 31W / 70W | 3MiB / 15109MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+

Turn on mixed precision training

from tensorflow.keras import mixed_precision mixed_precision.set_global_policy(policy="mixed_float16") # set global policy to mixed precision

WARNING:tensorflow:Mixed precision compatibility check (mixed_float16): WARNING The dtype policy mixed_float16 may run slowly because this machine does not have a GPU. Only Nvidia GPUs with compute capability of at least 7.0 run quickly with mixed_float16. If you will use compatible GPU(s) not attached to this host, e.g. by running a multi-worker model, you can ignore this warning. This message will only be logged once

Originally posted by @Jayms8462 in https://github.com/mrdbourke/tensorflow-deep-learning/discussions/82#discussioncomment-4501569

Jayms8462 commented 1 year ago

Comments from the discussion board so we know the issue is no longer a problem for me

I left this running for a few hours in case the ETA was not correct. Running this on google colab, the GPU is enabled in the runtime as seen above and I am using the paid middle tier. I also confirmed the GPU as a NVIDIA T4 right before fitting the model.

Just an Update, I was able to run mixed_precision in Tensorflow 2.11 with no issue

I was also able to get passed my problem by running Juniper Notebook locally on my system with a 3060ti and 64 GB of ram. This sped up the fitting of my model with all the data. instead of getting a few hours executing the fit, it dropped down to 2 min on the first epoch.

The solution to the issue is just updating to TensorFlow 2.11

Jayms8462 commented 1 year ago

absl-py==1.3.0 appdirs==1.4.4 astunparse==1.6.3 attrs==21.2.0 backcall==0.2.0 beautifulsoup4==4.10.0 beniget==0.4.1 blinker==1.4 Brotli==1.0.9 cachetools==5.2.0 certifi==2022.12.7 chardet==4.0.0 charset-normalizer==2.1.1 click==8.1.3 comm==0.1.2 command-not-found==0.3 contourpy==1.0.6 cryptography==3.4.8 cycler==0.11.0 dbus-python==1.2.18 debugpy==1.6.4 decorator==4.4.2 dill==0.3.6 distro==1.7.0 distro-info===1.1build1 dm-tree==0.1.8 entrypoints==0.4 etils==0.9.0 flatbuffers==22.12.6 fonttools==4.38.0 fs==2.4.12 gast==0.4.0 google-auth==2.15.0 google-auth-oauthlib==0.4.6 google-pasta==0.2.0 googleapis-common-protos==1.57.0 grpcio==1.51.1 h5py==3.7.0 html5lib==1.1 httplib2==0.20.2 idna==3.4 importlib-metadata==4.6.4 importlib-resources==5.10.2 ipykernel==6.19.4 ipython==7.31.1 jedi==0.18.0 jeepney==0.7.1 joblib==1.2.0 jupyter_client==7.4.8 jupyter_core==5.1.1 keras==2.11.0 keyring==23.5.0 kiwisolver==1.4.4 launchpadlib==1.10.16 lazr.restfulclient==0.14.4 lazr.uri==1.0.6 libclang==14.0.6 lxml==4.8.0 lz4==3.1.3+dfsg Markdown==3.4.1 MarkupSafe==2.1.1 matplotlib==3.6.2 matplotlib-inline==0.1.3 more-itertools==8.10.0 mpmath==0.0.0 nest-asyncio==1.5.6 netifaces==0.11.0 numpy==1.24.1 oauthlib==3.2.0 olefile==0.46 opt-einsum==3.3.0 packaging==22.0 parso==0.8.1 pexpect==4.8.0 pickleshare==0.7.5 Pillow==9.3.0 platformdirs==2.6.2 ply==3.11 promise==2.3 prompt-toolkit==3.0.28 protobuf==3.19.6 psutil==5.9.4 ptyprocess==0.7.0 pyasn1==0.4.8 pyasn1-modules==0.2.8 Pygments==2.11.2 PyGObject==3.42.1 PyJWT==2.3.0 pyparsing==2.4.7 python-apt==2.3.0+ubuntu2.1 python-dateutil==2.8.2 pythran==0.10.0 pytz==2022.1 PyYAML==5.4.1 pyzmq==24.0.1 requests==2.28.1 requests-oauthlib==1.3.1 rsa==4.9 scikit-learn==1.2.0 scipy==1.9.3 SecretStorage==3.3.1 six==1.16.0 sklearn==0.0.post1 soupsieve==2.3.1 sympy==1.9 systemd-python==234 tensorboard==2.11.0 tensorboard-data-server==0.6.1 tensorboard-plugin-wit==1.8.1 tensorflow==2.11.0 tensorflow-datasets==4.8.0 tensorflow-estimator==2.11.0 tensorflow-io-gcs-filesystem==0.29.0 tensorflow-metadata==1.12.0 termcolor==2.1.1 threadpoolctl==3.1.0 toml==0.10.2 tornado==6.2 tqdm==4.64.1 traitlets==5.8.0 typing_extensions==4.4.0 ubuntu-advantage-tools==27.12 ufoLib2==0.13.1 ufw==0.36.1 unattended-upgrades==0.1 unicodedata2==14.0.0 urllib3==1.26.13 wadllib==1.3.6 wcwidth==0.2.5 webencodings==0.5.1 Werkzeug==2.2.2 wget==3.2 wrapt==1.14.1 zipp==1.0.0

sgkouzias commented 1 year ago

Could you save and load the model having TensorFlow==2.11.0? I ultimately installed TensorFlow==2.8.1 and everything worked (mixed precision was utilized as well).