Fiji does not find TF-GPU

tibuch commented 4 years ago

@XarlesSta reported the following issue in N2V:

I am using N2V in Fiji. I installed properly tensor flow. But when I run N2V in Fiji, it seems that the PC is using the CPU and not the GPU. CPU is max out. It seems that Fiji is using the CPU and not the GPU. The Fiji console shows as is Fiji is using GPU-TF and not CPU-TF (see below). Thus, I do not understand why my CPU is max out. am I missing something here ?

Finally, why is not TF 1.15.0 GPU available in Edit>Option>TensorFlow ? I had to use the TF 1.14.0 GPU because I did not see the option for the 1.15 version. I am using a PC with Win10.

From Fiji-console: [INFO] Load TensorFlow.. [INFO] Using native TensorFlow version: TF 1.14.0 GPU (CUDA 10.0, CuDNN >= 7.4.1) Using 10% of training data for validation [INFO] Tile training and validation data.. [INFO] Generated 200 tiles of shape [128, 128] [INFO] Create session.. [INFO] Import graph.. [INFO] Normalizing.. [INFO] mean: 253.40125 [INFO] stdDev: 40.8419 [INFO] Augment tiles.. [INFO] Prepare training batches... 65 blind-spots will be generated per training patch of size [64, 64]. [INFO] Prepare validation batches.. 65 blind-spots will be generated per training patch of size [64, 64]. [INFO] Start training.. [INFO] Epoch 1/300 1 / 200 [----------] - loss: 1.836171 mse: 1.836171 abs: 1.108038 lr: 0.000400 2 / 200 [----------] - loss: 1.304834 mse: 1.304835 abs: 0.941930 lr: 0.000400 3 / 200 [----------] - loss: 1.094565 mse: 1.094565 abs: 0.839286 lr: 0.000400 4 / 200 [----------] - loss: 1.069823 mse: 1.069823 abs: 0.795247 lr: 0.000400

Note: Even when the Fiji console was not showing any "error", N2V was not running on GPUs. It began working after I downloaded, copied, and pasted the proper cuDNN files in the proper directories. https://docs.nvidia.com/deeplearning/sdk/cudnn-install/index.html#installwindows

@frauzufall could you have a look?

frauzufall commented 4 years ago

Dear @XarlesSta, thank you for reporting the issues!

If you don't see TensorFlow 1.15.0 that sounds like you might not have the newest version of our tools. The N2V update site is now obsolete, you only need the CSBDeep update site. Can you check if there are any updates available?

Also, did I understand your note right that after correctly moving the cuDNN files as described in the NVIDIA documentation your GPU is now used by N2V?

XarlesSta commented 4 years ago

I am using the CSBD update site. Even using this site, I cannot see the TF1.15.0 GPU. I newest version I see is the TF1.14.0 GPU (Cuda 10.0, cuDNN>=7.4.1). See pic. I was using the TF 1.15.0 in my PC but I had to downgrade to TF1.14.0 to make it N2V work in Fiji. Maybe there is a missing part from your end for windows machines?
Yes. N2V began working well after I moved the files as described in the Nvidia Documentation. The problem was (again) that I was having newer Nvidia drivers (Cuda 11.0 & cuDNN 7.4.6 or cuDNN 8.0.1). I had to downgrade my drivers to Cuda 10.0, cuDNN>=7.4.1 to make N2V work in Fiji. As I mentioned I only see the Cuda 10.0, cuDNN>=7.4.1, which are "old" drivers and I was having newer ones. After uninstalling the newer drivers, and downloading and moving the old driver as shown in the Nvidia the N2V began working.

TF Fiji GPU CGC_June27-2020 DSBDeep Update for TF Fiji GPU CGC_June27-2020

frauzufall commented 4 years ago

Thanks @XarlesSta, let's figure out a way where you don't have to downgrade TF. I am asking someone with Windows to check if the version is missing too, but here's another test we can do: Can you please look into Fiji.app/jars and check which imagej-tensorflow version is in there? It should be imagej-tensorflow-1.1.4.jar. In case it's an older version, another update site might overwrite it (e.g. on https://sites.imagej.net/TensorFlow is still an older version, I'll update it soon)

XarlesSta commented 4 years ago

This is strange. it says is the version 1.15. see screenshot, but in the TensoFlow library version manager, the highest I can see is the 1.14.0 tensorflowpjar_screenshot

frauzufall commented 4 years ago

@XarlesSta I finally got Windows running in my VM and was able to reproduce and hopefully fix both issues you reported. Please update and let me know how it goes!

XarlesSta commented 4 years ago

Deborah,

I upgraded to TensorFlow==1.15.0 and TensorFlow-gpu==1.15.0
Upgraded to CUDA 10.1
Upgraded to CuDNN==7.5.1 (Downloaded, copied and pasted the files in the respective folders)
Updated Fiji
Changed the tensor flow setting in Fiji: edit>option>tensoflow
I changed the settings in Edit>options>tensor flow to TF1.15.0 (DUDA 10.1, CuDNN>=7.5.1. Training does not run.

[INFO] Load TensorFlow.. [INFO] Using native TensorFlow version: TF 1.15.0 GPU (CUDA 10.1, CuDNN >= 7.5.1) Using 10% of training data for validation [INFO] Tile training and validation data.. [INFO] Generated 300 tiles of shape [32, 32] [INFO] Create session.. [INFO] Import graph.. [INFO] Normalizing.. [INFO] mean: 232.38467 [INFO] stdDev: 26.186493 [INFO] Augment tiles.. [INFO] Prepare training batches... 4 blind-spots will be generated per training patch of size [16, 16]. [INFO] Prepare validation batches.. 4 blind-spots will be generated per training patch of size [16, 16]. [INFO] Start training.. [INFO] Epoch 1/50 java.util.concurrent.ExecutionException: org.tensorflow.TensorFlowException: 2 root error(s) found. (0) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above. [[{{node down_level_0_no_0/convolution}}]] [[metrics/n2v_abs/Mean/_41]] (1) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above. [[{{node down_level_0_no_0/convolution}}]] 0 successful operations. 0 derived errors ignored. at java.util.concurrent.FutureTask.report(FutureTask.java:122) at java.util.concurrent.FutureTask.get(FutureTask.java:192) at de.csbdresden.n2v.train.N2VTraining.train(N2VTraining.java:187) at de.csbdresden.n2v.command.N2VTrainCommand.mainThread(N2VTrainCommand.java:153) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: org.tensorflow.TensorFlowException: 2 root error(s) found. (0) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above. [[{{node down_level_0_no_0/convolution}}]] [[metrics/n2v_abs/Mean/_41]] (1) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above. [[{{node down_level_0_no_0/convolution}}]] 0 successful operations. 0 derived errors ignored. at org.tensorflow.Session.run(Native Method) at org.tensorflow.Session.access$100(Session.java:48) at org.tensorflow.Session$Runner.runHelper(Session.java:326) at org.tensorflow.Session$Runner.run(Session.java:276) at de.csbdresden.n2v.train.N2VTraining.runTrainingOp(N2VTraining.java:419) at de.csbdresden.n2v.train.N2VTraining.mainThread(N2VTraining.java:310) ... 5 more

While using TF 1.14.0 GPU, I still have the same error only during prediction. Training works fine. I ran this one after upgrading to 1.15/7.5.1/10.1

{name=Mon Jun 29 09:49:24 PDT 2020 lowest loss, description=null, cite=[{text=Krull, A. and Buchholz, T. and Jug, F. Noise2void - learning denoising from single noisy images. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2019), doi=arXiv:1811.10980}], authors=null, documentation=null, tags=[denoising, unet2d], license=null, format_version=0.2.0-csbdeep, language=java, framework=tensorflow, source=n2v, test_input=testinput.tif, test_output=testoutput.tif, inputs=[{name=input, axes=byxc, data_type=float32, data_range=[-inf, inf], halo=[0, 22, 22, 0], shape={min=[1, 16, 16, 1], step=[1, 16, 16, 0]}}], outputs=[{name=activation_11/Identity, axes=byxc, data_type=float32, data_range=[-inf, inf], shape={reference_input=input, scale=[1.0, 1.0, 1.0, 1.0], offset=[0, 0, 0, 0]}}], training={source=de.csbdresden.n2v.train.N2VTraining, kwargs={batchSize=64, learningRate=4.0E-4, trainDimensions=2, neighborhoodRadius=5, numEpochs=50, numStepsPerEpoch=200, patchShape=64, stepsFinished=1200}}, prediction={weights={source=./variables/variables}, preprocess=[{spec=de.csbdresden.n2v.predict.N2VPrediction::preprocess, kwargs={mean=[247.64154], stdDev=[33.277397]}}], postprocess=[{spec=de.csbdresden.n2v.predict.N2VPrediction::postprocess, kwargs={mean=[247.64154], stdDev=[33.277397]}}], dependencies=./dependencies.yaml}} N2V prediction mean : 247.64154 N2V prediction stdDev: 33.277397 [INFO] Using native TensorFlow version: TF 1.14.0 GPU (CUDA 10.0, CuDNN >= 7.4.1) [INFO] Loading TensorFlow model Mon Jun 29 10:05:21 PDT 2020 last checkpoint_1593450321754 from source file file:/C:/Users/GARZON~1/AppData/Local/Temp/n2v-3982392754320368095.bioimage.io.zip [INFO] Caching TensorFlow models to C:\Fiji.app\models [INFO] Unpacking dependencies.yaml java.io.FileNotFoundException: C:\Fiji.app\models\Mon Jun 29 10:05:21 PDT 2020 last checkpoint_1593450321754\dependencies.yaml (The filename, directory name, or volume label syntax is incorrect) at java.io.FileOutputStream.open0(Native Method) at java.io.FileOutputStream.open(FileOutputStream.java:270) at java.io.FileOutputStream.(FileOutputStream.java:213) at java.io.FileOutputStream.(FileOutputStream.java:162) at net.imagej.tensorflow.util.UnpackUtil.unZip(UnpackUtil.java:149) at net.imagej.tensorflow.util.UnpackUtil.unZip(UnpackUtil.java:131) at net.imagej.tensorflow.DefaultTensorFlowService.downloadAndUnpackResource(DefaultTensorFlowService.java:407) at net.imagej.tensorflow.DefaultTensorFlowService.modelDir(DefaultTensorFlowService.java:384) at net.imagej.tensorflow.DefaultTensorFlowService.loadCachedModel(DefaultTensorFlowService.java:131) at net.imagej.modelzoo.consumer.model.tensorflow.TensorFlowModel.loadModelFile(TensorFlowModel.java:145) at net.imagej.modelzoo.consumer.model.tensorflow.TensorFlowModel.loadModel(TensorFlowModel.java:110) at net.imagej.modelzoo.DefaultModelZooArchive.createModelInstance(DefaultModelZooArchive.java:102) at net.imagej.modelzoo.consumer.DefaultModelZooPrediction.loadModel(DefaultModelZooPrediction.java:107) at net.imagej.modelzoo.consumer.DefaultModelZooPrediction.run(DefaultModelZooPrediction.java:81) at de.csbdresden.n2v.predict.N2VPrediction.run(N2VPrediction.java:89) at de.csbdresden.n2v.predict.N2VPrediction.predictPadded(N2VPrediction.java:105) at de.csbdresden.n2v.command.N2VTrainPredictCommand.mainThread(N2VTrainPredictCommand.java:213) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) [ERROR] Model does not exist or cannot be loaded. Exiting.

frauzufall commented 4 years ago

@XarlesSta thank you for being so incredibly patient. Turns out I was running TF 1.15.0 on the CPU instead of GPU too because it is actually not compatible with CUDA 10.1, but CUDA 10.0 instead, not sure where we got the wrong information from. Once I switched back to CUDA 10.0 it ran on the GPU.. Are you willing to downgrade again and try that? I'll make sure to update the displayed version in the TensorFlow options command.

frauzufall commented 4 years ago

Regarding the prediction, I hopefully really fixed that now with imagej-modelzoo-0.1.8 which you can download from here as long as the update sites are down, copy into Fiji.app/jars and delete the existing imagej-modelzoo-0.1.7.jar file. I get a different prediction result though with your example model than the test images in the model suggest - did you replace the training data in the model? Just asking to make sure there is nothing else going wrong.

XarlesSta commented 4 years ago

Deborah. It is working fine now. Hurray!!!

For reference, I did the following:

Downgrade to CUDA 10.0
Download, copied and pasted the drivers : cuDNN 10.0 windows 10 64x x64 v7.6.5.32
delete imagej-modelzoo-0.1.7.jar from the Fiji.app/jars folder
Download, copied and pasted imagej-modelzoo-0.1.8.jar into the Fiji.app/jars folder

Thanks for your help, It was quite a journey.

Notes & questions:

is it possible to assign the folder where to store the model ? as it is right now, one has to find the "user/temp/.../" folder and extract the model from there to a convenient location (i.e. Desktop or the data folder). It would be quite nice if one could just "save as" the model to a convenient location.
The Tensorflow library version manager still shows as "TF 1.15.0 GPU (CUDA 10.1 CuDNN>=7.5.1)". I imagine that was changed in the imagej-modelzoo-0.1.8.jar but has not been updated in the tensorFlow manager.
If I have a time series rather than a volume, should I assign the dimensions as "XYB" or as "XYZ"?

frauzufall commented 4 years ago

You really helped a lot making everything work on Windows, thank you!! I hope you can now actually benefit from the plugin.

I agree. I'll think of something and improve this.
This is fixed in imagej-tensorflow-1.1.6 which I was not yet able to upload because the updater is currently not working
Use XYB, in this case each timestep is processed individually. Use XYZ only if you actually trained on volumetric data and the input prediction data is also volumetric. Not sure if one could use the 3D training mode to train on time series and then use XYZ to predict time series, what do you think @tibuch?

fadelvalle commented 3 years ago

Hi ! . M y GPU doesn’t seem to do any work when training the models (no load on the system monitor) Even thought that I have installed tensorflow without problems and it shows in the Fiji menu as being used.

Does N2V only work under CUDA 10 drivers ? Because I’m using CUDA 11.3 . CudNN’s are also up-to date. Thanks in advance !

juglab / N2V_fiji

Fiji does not find TF-GPU #18