voldemortX / pytorch-auto-drive

PytorchAutoDrive: Segmentation models (ERFNet, ENet, DeepLab, FCN...) and Lane detection models (SCNN, RESA, LSTR, LaneATT, BézierLaneNet...) based on PyTorch with fast training, visualization, benchmarking & deployment help
BSD 3-Clause "New" or "Revised" License
837 stars 137 forks source link

Current limit exceeds maximum limit #114

Closed mikeklm closed 1 year ago

mikeklm commented 1 year ago

I'm trying to run validation on the culane dataset using the scnn model with vgg16 backbone, using your pre-trained model: vgg16_scnn_culane_20210309.pt

I followed all of the instructions for setting up the dataset and the configuration file for my use case.

I get an error I don't understand when entering 'main_landet.py'. In the first few lines I get an error

Exception has occurred: ValueError

current limit exceeds maximum limit
  File "/home/mkraus/field-apps-eng/pytorch-auto-drive/main_landet.py", line 18, in <module>
    resource.setrlimit(resource.RLIMIT_NOFILE, (8192, rlimit[1]))

The command line I'm using to run is python man_landet.py --val --config=pathtomyconfig

I'm running on Linux 4.15.0-192-generic x86_64. I'm running Python 3.7.13 on an 128 core machine with 512GB. My accelerator is an 8 card A100.

This seems to be a very low level issue that has to do with resource limits . Can someone please help me resolve this issue, and explain why this inference script needs to check the number of file descriptors allowed by my kernel? Thanks

voldemortX commented 1 year ago

@mikeklm This was set as a system-level guarantee for multiprocessing in data loading. The main reason was: most lane detection models are lightweight and have very low GPU utilization ratio, so the number of workers are relatively high by default. The rlimit is set to 8192 by default since most machines allow such a value.

You could just comment those two lines for now. I'll think about changing the default code or adding some warnings.

mikeklm commented 1 year ago

Thank you very much for replying. I will try commenting out those lines.

I'm not sure I completely follow about the light weight model. When you say you need a large number of workers, are you talking about workers on the device side or the host side?

Do you need to send so many files to the device at a time that you need to have a huge number of file descriptors available on the host, in order to pass the data to the device efficiently? Thanks

voldemortX commented 1 year ago

The data loading and augmentation are both done by cpu, which is not fast enough, compared to the GPU forward/backward pass. So there are usually 1/2 dozen worker processes that use shared mem to load data. Although maybe this only affects the file system mode, I'm not so sure right now, do you have more insight on this?

mikeklm commented 1 year ago

Let me think about this and get back to you. I'm not familiar enough yet with your pre-processing pipeline and your augmentation strategy.

I just started looking at this type of network and your code on Friday.

Have you tried profiling both host and device for one or two iterations?

There are a number of general methods to decrease the latency of pre-processing. For example, I have done augmentation on device in this type of situation (I wasn't using a GPU however). It all depends on the resources available.

In some cases, if the dataset is small, it may make sense to cache the augmented images during training, or even store them to disk ahead of time, rather than processing them in real time.

Let me give it some thought and we can talk offline.

I'll close this issue. Commenting those lines of code allowed me to run without any issues. Thanks

voldemortX commented 1 year ago

Sounds great, looking forward to your investigations!

voldemortX commented 1 year ago

@mikeklm I might have some recollection of this issue by looking into here: https://pytorch.org/docs/1.6.0/multiprocessing.html#sharing-strategies

We might have a lot of open files when batch size & #worker processes are large at the same time, that was probably the reason why I set that 8192 limit when training LSTR.

In #117 I reduce the error to a warning, since probably nothing will happen even if we use default limit on most systems (usually the default is 1024).