An error appears during the training that may pass in a non-contiguous input.

hellojialee / Improved-Body-Parts

Simple Pose: Rethinking and Improving a Bottom-up Approach for Multi-Person Pose Estimation

https://arxiv.org/abs/1911.10529

258 stars 42 forks source link

An error appears during the training that may pass in a non-contiguous input. #13

Closed mengfanShi closed 4 years ago

mengfanShi commented 4 years ago

So glad to see your project, I successfully run the demo, create the h5 file. But when I try to train the model, An error appears just like: RuntimeError: cuDNN error: CUDNN_STATUS_NOT_SUPPORTED. This error may appear if you passed in a non-contiguous input. I really hope to get your help, thank you very much.

hellojialee commented 4 years ago

Hi, I'm glad you are interested in this repo. It seems that many reasons could arise this error. For example: https://discuss.pytorch.org/t/resolved-batchnorm1d-cudnn-status-not-supported/3049 I use Cuda 10.1 and Cuddn 7.4. And Pytorch 1.0, 1.1, 1.2, 1.3 and 1.4 work fine. Please first make sure the packages' versions are proper before other tries.

hellojialee commented 4 years ago

Hi :) I have received you email. I'm using HDF5 1.10 (better multiprocessing handling) that supports SWMR mode. https://github.com/jialee93/Improved-Body-Parts/blob/316e71fa93e1dc444b1cfd4fc312c21c13bfe93f/py_cocodata_server/py_data_iterator.py#L42

There is a good discussion here, and I concluded the discussion here.

mengfanShi commented 4 years ago

Thanks for your response : ) I find that even though I use h5py downloaded by pip (version 2.10.0), it can also supports SWMR mode, is it still necessary to install HDF5 to rebuilt h5py ? I also test the train_parallel.py, same error occurs T_T. BTW, I use Cuda 10.1 and Cndnn 10.0, Pytorch 1.4.0

mengfanShi commented 4 years ago

I have rebuilt the h5py by HDF5, but the error still occurs. It's hard to locate the problem T_T.

hellojialee commented 4 years ago

Sorry for what you are suffering😓. I have never met such errors before. I haven't used Cndnn 10.0. What if you set torch.backends.cudnn.enabled = False?

mengfanShi commented 4 years ago

I have tried it before, but it seems useless.

hellojialee commented 4 years ago

:( Then, I have no idea for now. Feel free to discuss if more information is found.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.