Closed mikeklm closed 1 year ago
@mikeklm This was set as a system-level guarantee for multiprocessing in data loading. The main reason was: most lane detection models are lightweight and have very low GPU utilization ratio, so the number of workers are relatively high by default. The rlimit
is set to 8192 by default since most machines allow such a value.
You could just comment those two lines for now. I'll think about changing the default code or adding some warnings.
Thank you very much for replying. I will try commenting out those lines.
I'm not sure I completely follow about the light weight model. When you say you need a large number of workers, are you talking about workers on the device side or the host side?
Do you need to send so many files to the device at a time that you need to have a huge number of file descriptors available on the host, in order to pass the data to the device efficiently? Thanks
The data loading and augmentation are both done by cpu, which is not fast enough, compared to the GPU forward/backward pass. So there are usually 1/2 dozen worker processes that use shared mem to load data. Although maybe this only affects the file system
mode, I'm not so sure right now, do you have more insight on this?
Let me think about this and get back to you. I'm not familiar enough yet with your pre-processing pipeline and your augmentation strategy.
I just started looking at this type of network and your code on Friday.
Have you tried profiling both host and device for one or two iterations?
There are a number of general methods to decrease the latency of pre-processing. For example, I have done augmentation on device in this type of situation (I wasn't using a GPU however). It all depends on the resources available.
In some cases, if the dataset is small, it may make sense to cache the augmented images during training, or even store them to disk ahead of time, rather than processing them in real time.
Let me give it some thought and we can talk offline.
I'll close this issue. Commenting those lines of code allowed me to run without any issues. Thanks
Sounds great, looking forward to your investigations!
@mikeklm I might have some recollection of this issue by looking into here: https://pytorch.org/docs/1.6.0/multiprocessing.html#sharing-strategies
We might have a lot of open files when batch size & #worker processes are large at the same time, that was probably the reason why I set that 8192 limit when training LSTR.
In #117 I reduce the error to a warning, since probably nothing will happen even if we use default limit on most systems (usually the default is 1024).
I'm trying to run validation on the culane dataset using the scnn model with vgg16 backbone, using your pre-trained model: vgg16_scnn_culane_20210309.pt
I followed all of the instructions for setting up the dataset and the configuration file for my use case.
I get an error I don't understand when entering 'main_landet.py'. In the first few lines I get an error
Exception has occurred: ValueError
The command line I'm using to run is
python man_landet.py --val --config=pathtomyconfig
I'm running on Linux 4.15.0-192-generic x86_64. I'm running Python 3.7.13 on an 128 core machine with 512GB. My accelerator is an 8 card A100.
This seems to be a very low level issue that has to do with resource limits . Can someone please help me resolve this issue, and explain why this inference script needs to check the number of file descriptors allowed by my kernel? Thanks