uncomplicate / deep-diamond

A fast Clojure Tensor & Deep Learning library
https://aiprobook.com
Eclipse Public License 1.0
432 stars 17 forks source link

mnist ttes fails with CUDNN_STATUS_NOT_SUPPORTED #18

Closed behrica closed 2 years ago

behrica commented 2 years ago

I have the mnist-classification-test failing:

OS: arch-linux CUDA: cuda_11.6.1_510.47.03_linux CUDNN: cudnn-linux-x86_64-8.4.0.27_cuda11.6

The tests in uncomplicate.diamond.functional.mnist.mnist-classification-test faill all. with: CUDNN_STATUS_NOT_SUPPORTED

All tests in test/uncomplicate/diamond/internal/cudnn/

Any idea what that could be ?

behrica commented 2 years ago

What is weired, is that all tensors seem to contaon "0" only...

(frequencies train-labels)
->> {0 60000}
blueberry commented 2 years ago

What is your hardware?

Can you please paste the output of nvidia-smi?

blueberry commented 2 years ago
behrica commented 2 years ago

The issue was different spelling in filenames of input files. The map-tensor does not fail on it, but reads all 0....

blueberry commented 2 years ago

These files are full of images of white numbers on black background. Most of the numbers are zeroes (but not all of course). But, if you're only looking at the first 100 numbers or so, these are 0.

behrica commented 2 years ago

No, I head a "wrong spelling" in the file names. All labels were 0 as well. Somehow the map-tensor ignore "file not found" errors

behrica commented 2 years ago

This does not fail, but produces a tensor of all 0:

(def train-images-file (random-access "asdòlsadaòd"))
(def train-images (map-tensor train-images-file [60000 1 28 28] :uint8 :nchw :read 16))
blueberry commented 2 years ago

"asdòlsadaòd" has to be the actual file containing MNIST images dataset (in binary matrix form explained on the mnist datasite, and in the DLFP book).

behrica commented 2 years ago

yes, The original issue came because I had a misspelled filename. So the filename I gave did not exist on disk.

But to my surprise, map-tensor does not fail when giving it an non existing file. It returns tensors of all Zero, which at the end let to [CUDNN_STATUS_NOT_SUPPORTED]

So we can keep close this here, but i would suggest to make map-tensor fail on no exsitiong files. This could help to avoid future confusions.

blueberry commented 2 years ago

I would say it works as intended. Here's why:

  1. map-tensor does not deal with file management per se. It is up to the caller to provide a valid file. An you did provide a valid file. How so, if you mistyped the name?
  2. train-imges-file is a RandomAccessFile (standard Java 7) If you check your code you'll see that the file object exists after random-access returns. How so?
  3. As explained in the standard Java docs, https://docs.oracle.com/javase/7/docs/api/java/io/RandomAccessFile.html#RandomAccessFile(java.lang.String,%20java.lang.String), the constructor is going to create a file with the provided name if one doesn't exist. Only if it can't create a new file with that name, the exception is going to be thrown: "FileNotFoundException - if the mode is "r" but the given string does not denote an existing regular file, or if the mode begins with "rw" but the given string does not denote an existing, writable regular file and a new regular file of that name cannot be created, or if some other error occurs while opening or creating the file".
  4. You don't have to use the random-access function to grab the file with your data. It is important that you provide a file that can be mapped to. Use any Java/Clojure method that does that in a way that satisfies your constraints and requirements.
behrica commented 2 years ago

Finally I understand the confusions.

This line of the example:

(def train-images-file (random-access "data/mnist/train-images-idx3-ubyte")) opens the file in mode "rw",(and not in 'r') as I would have expected.

This is why the Java IO functions do not fail, even when giving a non existing file name.

So probbaly there is nothing wrong anywhere, (except that I sould open data files in mode "r") to avoid issues in case of wrong file names

(def train-images-file (random-access "data/mnist/train-images-idx3-ubyte" "r"))

blueberry commented 2 years ago

Theoretically, yes. The mode should be :read in clojure. However, try it out first, because I think that even that mode is not going to throw any exception.