vlfeat / matconvnet

MatConvNet: CNNs for MATLAB
Other
1.4k stars 753 forks source link

libjpeg error #694

Closed cutybug closed 7 years ago

cutybug commented 8 years ago

Hello. I'm testing the example code of cnn_imagenet.m, however, I'm getting a repeated error code saying;

" ...(some image file name)...: error 'libjpeg: Improper call to JPEG library in state 202' "

constantly during training. (The training procedure itself is not terminated, but the above error message is repeated over and over in the console window.) The only things I have changed in the example code are the path to the imagenet data and "opts.train.gpus = [1 2 3 4];" (The machine I have has four titan x GPUs.)

I have compiled Matconvnet as follows:

vl_compilenn('enableGpu', true, 'cudaRoot', '/usr/local/cuda', 'cudaMethod', 'nvcc', 'enableCudnn', true, 'cudnnRoot', '/usr/local/cuda');

My system is: Xubuntu 14.04, cuda/cudnn 7.5, MATLAB R2016a, and latest Matconvnet. I guess that this error has something to do with the libjpeg library in the system, I have been trying for more than a week to find a cure, with no success. (I couldn't find much information about this error on the web.)

If anybody can help me, I will really appreciate it. Thank you in advance.

lenck commented 8 years ago

Hi, was not able to find much about it either... :/ But few questions:

This may help to track the issue... I hope :P

cutybug commented 8 years ago

Thanks for providing some directions.

  1. Not all, but for a substantial amount of images. (It seems that most of the images get this message after some point during the training procedure.)
  2. dpkg says that I have the following libraries: libjpeg-dev:amd64, libjpeg-turbo8:amd64, libjpeg-turbo8-dev:amd64, libjpeg8:amd64, libjpeg8-dev:amd64 / vl_imreadjpeg.mexa64 seems to be using libjpeg.so.8. (I guess this is libjpeg8?)
  3. Actually, I found out that this happens for all cases (4 GPUs, 1 GPU, or even no GPU = CPU). So this is not a GPU or multi-GPU thing.

It seems that I have multiple instances of libjpeg.so in my system. I think I'll try other files. Do you know how I can point to a specific lib file in vl_compilenn?

tinalegre commented 8 years ago

Any news about this issue? I have been also trying to train alexnet on imagenet on a GPU (Tesla K40). I got the same errors as cutybug right after the first training iteration (around train: epoch 01: 250/5005:) using the current version of matconvnet (beta 22). As I am having problems with the compilation/installation , I compiled the library using GPU in the most simple way: vl_compilenn('enableGpu', true). Some images for which it does not work are:

/scratch/imagenet/images/train/n01797886/n01797886_9854.JPEG: error 'libjpeg: Improper call to JPEG library in state 202'
/scratch/imagenet/images/train/n02125311/n02125311_16634.JPEG: error 'libjpeg: Improper call to JPEG library in state 202'
/scratch/imagenet/images/train/n02133161/n02133161_3988.JPEG: error 'libjpeg: Improper call to JPEG library in state 202'
/scratch/imagenet/images/train/n09246464/n09246464_23298.JPEG: error 'libjpeg: Improper call to JPEG library in state 202'
/scratch/imagenet/images/train/n02105641/n02105641_12238.JPEG: error 'libjpeg: Improper call to JPEG library in state 202'
/scratch/imagenet/images/train/n02027492/n02027492_2535.JPEG: error 'libjpeg: Improper call to JPEG library in state 202'
/scratch/imagenet/images/train/n01667778/n01667778_21162.JPEG: error 'libjpeg: Improper call to JPEG library in state 202'
/scratch/imagenet/images/train/n01608432/n01608432_11728.JPEG: error 'libjpeg: Improper call to JPEG library in state 202'

It is worth to mention that the error message is not shown for all images. Also when training alexnet on imagenet for older versions of matconvnet everything went fine. I am using:

Linux System: CentOS Linux release 7.2.1511
GCC: version 4.8.5 20150623
Matlab: R2015b
Cuda: 7.5
cuDNN: v4

$ ldconfig -p | grep libjpeg
    libjpeg.so.62 (libc6,x86-64) => /lib64/libjpeg.so.62
    libjpeg.so.62 (libc6) => /lib/libjpeg.so.62
    libjpeg.so (libc6,x86-64) => /lib64/libjpeg.so
    libjpeg.so (libc6) => /lib/libjpeg.so
tinalegre commented 8 years ago

I decided to train the Alexnet on Imagenet using the provided example from matconvnet being aware of the errors mentioned above regarding the libjpeg. As info, below the graph I have obtained after 20 iterations (objective: 6.910 top1err: 0.999 top5err: 0.995). One can clearly see that the training is not working as it should. The reason is mainly because the images couldn't be read. Any ideas on how to solve the libjpeg problem are really welcomed! I hope this is not a major issue and other models can be properly trained.

AlexnetNew

Below the training results of Alexnet on Imagenet using 20 iterations and a previous version of matconvnet.

AlexnetOld

lenck commented 8 years ago

We really cannot reproduce this issue, even though we run it at almost exactly the same configuration.

Just one question - the mount /scratch/ - is it a local storage, or some more fancy file-system? Maybe there are issues with that with the new implementation...

Also, does the same happen when you pre-process the images to constant size (with utils/preprocess-imagenet.sh)?

tinalegre commented 8 years ago

@lenck Thanks for your reply. '/scratch' is a simple local storage, nothing fancy. I downloaded the toolbox again and reinstalled/compiled everything, but it didn't work: make ARCH=glnxa64 MATLABROOT=/usr/local/MATLAB/R2015b/ ENABLE_GPU=yes CUDAROOT=/usr/local/cuda-7.5/ CUDAMETHOD=nvcc ENABLE_CUDNN=yes CUDNNROOT=/opt/cuDNN-v5.1/ ENABLE_IMREADJPEG=yes LIBJPEG_INCLUDE=/usr/include/ LIBJPEG_LIB=/usr/lib64/ When I run the cnn_imagenet example without gpu, then it works. It seems to be therefore that there's a problem with the compilation of vl_imageread. Any ideas/sugestions?

tinalegre commented 8 years ago

@cutybug were you able to solve the problem?

cutybug commented 8 years ago

Unfortunately, no :(

lenck commented 7 years ago

Hmm, maybe a hacky workaround would be to pre-process the images with utils/preprocess_imagenet.sh (thanks Giorgos). It is probably that there are some bad jpeg files in the original dataset which break the state of the libjpeg, so it crashes on the next image. I will try to do some tests, but thanks to the impeding CVPR deadline it may take some time...

HunterHantao commented 6 years ago

Is there any update on this error? I also encounter the same error