Keeping model file in memory to speed-up inference

semih-ahishali commented 1 year ago

Search before asking

[X] I have searched the YOLOv5 issues and discussions and found no similar questions.

Question

Trained model file in our project reached to ~80 MB of size. I think it will continue to expand day by day with our new objects added to classes and new trainings. Is there a way to keep this model file in memory instead of disk and access it quicker to increase inference speed? Thank you.

Additional

Like writing the name of the file/object in memory instead of location of the file on disk to 'path' parameter of torch.hub.load command?

github-actions[bot] commented 1 year ago

👋 Hello @semih-ahishali, thank you for your interest in YOLOv5 🚀! Please visit our ⭐️ Tutorials to get started, where you can find quickstart guides for simple tasks like Custom Data Training all the way to advanced concepts like Hyperparameter Evolution.

If this is a 🐛 Bug Report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.

If this is a custom training ❓ Question, please provide as much information as possible, including dataset images, training logs, screenshots, and a public link to online W&B logging if available.

For business inquiries or professional support requests please visit https://ultralytics.com or email support@ultralytics.com.

Requirements

Python>=3.7.0 with all requirements.txt installed including PyTorch>=1.7. To get started:

git clone https://github.com/ultralytics/yolov5  # clone
cd yolov5
pip install -r requirements.txt  # install

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Notebooks with free GPU:
Google Cloud Deep Learning VM. See GCP Quickstart Guide
Amazon Deep Learning AMI. See AWS Quickstart Guide
Docker Image. See Docker Quickstart Guide

Status

If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv5 training, validation, inference, export and benchmarks on MacOS, Windows, and Ubuntu every 24 hours and on every commit.

AyushExel commented 1 year ago

@semih-ahishali trained model will not increase as you add more data.

semih-ahishali commented 1 year ago

@AyushExel will it stay at at near this size although I train hundreds of classes? That's realy good. Thank you. Even it stays at this size, is it possible to keep it memory and reach from memory instead of disk?

JustasBart commented 1 year ago

@semih-ahishali The physical size of the the Network is determined by the Network size itself rather than the number of classes or the number of images that were used to train the model. Look at the [params(M)] the bigger / more complex the Network the bigger the actual physical file size, makes sense?

And besides even if you had 1,000 classes exported as .onnx that would take less than 1MB to store that...

Hope that helps, have a good one! :rocket:

semih-ahishali commented 1 year ago

Thank you for this meaningful info you gave. I read the documents and tried to understand these figures. But especialy I want to ask which speed figure in this table represents the inference speed? Training time is not an issue for me, I can train custom objects for days, no problem. My biggest concern is to get quickest inference time, so which model should I use for 1280 pixels photo?

AyushExel commented 1 year ago

@semih-ahishali speed b1 cpu is the speed of inference of one image on cpu. V100 b1 is the same speed on V100 GPU

JustasBart commented 1 year ago

@semih-ahishali And when you think about it in addition to what @AyushExel has said, the more parameters (Complexity) the model has the longer it will take to 'crunch' it down during the Training/Inference stages, so naturally a YOLOv5l6 is going to take longer to Train/Infer than a YOLOv5s6 on the exact same hardware, however if you have a complex dataset with many classes you might have to use the bigger and more complex model (But also slower) in exchange of it being simply better at the task. That also scales with the resolution of your input, that is if you were to use a 640x640 YOLOv5l6 versus a 1280x1280 YOLOv5l6 it would make a significant Training/Inference speed difference...

Ultimately you just need to figure out a happy ratio of the model complexity + the input size which you can easily do by setting aside a small (~250 images/class) dataset and by crunching that through to see what works best for the different models/input sizes and you can also try training it with --rect assuming that your data is something like 1920x1080.

Hope this helps, good luck! :rocket:

semih-ahishali commented 1 year ago

@AyushExel and @JustasBart Thank you for the information you gave. One last question; as you know Python can use all CPU cores with multiprocess library. Today we have servers with 3 Ghz Xeon CPU with 8 - 16 or more cores but without multiprocessing Python normaly use only one core, so this infact effects performans significantly. Is it possible to use multiprocessing library or equivilant to use all cores for inference to speed-up?

JustasBart commented 1 year ago

@semih-ahishali I personally wouldn't be able to answer that as I'm mainly a C++ dev, I think for that you'd have to talk to the big guns @glenn-jocher or someone that would be familiar with this sort of thing...

AyushExel commented 1 year ago

@semih-ahishali a few things can be explored here. I'll assume that we're only talking about inference here. First, I've seen some scripts that parallelize the opencv inference operation. I assume you can use multiprocessing to process frames individually on a single model initialized on independent processes. But the problem would be to then synchronize the different processes and display the frames in order. Also, the system resources used would increase dramatically - n times where n is the number of processes as multiprocessing just spins up another python interpreter. It's an interesting idea but something that we haven't explored.

semihahishali commented 1 year ago

@JustasBart it is the same as I know in C++ also. If you don't use a framework/library for multiprocessing then your script runs on only one core of CPU. Because Python is a programming language written on C++ mostly, it must be the same. So if you want to use all of your CPU computing power you should run your script on multiprocessing. If there is only one trained model file ( best.pt) on your system, then multiprocessing may not be able to process this one file on different processes at the same time for inference to speed it up. To make it possible, best.pt file and the library (pytorch etc.) should support multiprocessing with sharing frame info and results within different processes at the same time. It is not available for now I think. So, I can use multiprocessing for just the rest of the code (client requests, calculations, db operations etc). In this manner, if it is possible it would be better to put trained model file into memory to speed up access time to file. Instead of disk access, ram/memory file access would dramatically be faster. So to do this maybe we can change the path argument in this definition to a memory file which has been read and stored before in memory. TRAINED_MODEL = torch.hub.load('C/yolov5', 'custom', path='C:/model/best.pt', source='local') Here is my suggestion, if it may be.

JustasBart commented 1 year ago

@semihahishali Hmm, you might be onto something, but again the reason why I've proposed C++ is because I would do it via OpenCVs DNN module mixed with CUDA drivers and cuDNN library meaning that I can use OpenCVs VideoCapture to grab a cv::Mat frame and then without having to make a conversion I would pass that to be inferred directly on my GPU (Where I think I'm beginning to understand what your point is) because the application would load the model (Based on an ONNX network) and it would keep it for the duration of the run-time of the application meaning that the loading of the model and the first few frames are an expensive operation but then right after that is where the savings kick in, but in your case it sounds like you're running Inference on the CPU but where you also have to call the script each and every time (Meaning that your model is loaded and discarded each and every time) and thus you don't have a way to keep the model in memory and also it doesn't seem like the CPU inference is running any sort of parallelism...

Is it at all possible for you to have one Python script running alongside of the Operating system (Basically from start-up to shut-down) that would have the model loaded and then are you able to use another Python script that would somehow either ask it for a pointer to the model and/or pass it a new frame and then wait for the bounding boxes to be returned? Or perhaps are you able to turn like a some sort of a very low-level server based something something on the Python side to keep the model alive that way? (Ask ChatGTP to help out perhaps?)

That would only theoretically solve your model-in-memory issue if you could pull something like that off... As per the parallelism on the CPU I'm not too sure if that's even possible really... I just know that due to the nature of GPUs in general they are essentially designed to do everything in parallel whereas I don't know if the problem can be easily broken down into parallel (Threaded) chunks for the CPU to crunch through...

Either way, it sounds like you've some head scratching to do, my apologies I couldn't really help you in the end, but good luck on your endeavours! :rocket:

github-actions[bot] commented 1 year ago

👋 Hello, this issue has been automatically marked as stale because it has not had recent activity. Please note it will be closed if no further activity occurs.

Access additional YOLOv5 🚀 resources:

Wiki – https://github.com/ultralytics/yolov5/wiki
Tutorials – https://docs.ultralytics.com/yolov5
Docs – https://docs.ultralytics.com

Access additional Ultralytics ⚡ resources:

Ultralytics HUB – https://ultralytics.com/hub
Vision API – https://ultralytics.com/yolov5
About Us – https://ultralytics.com/about
Join Our Team – https://ultralytics.com/work
Contact Us – https://ultralytics.com/contact

Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

Thank you for your contributions to YOLOv5 🚀 and Vision AI ⭐!

semih-ahishali commented 1 year ago

I've searched this issue and figured out that you can keep any file in memory and access it from memory with keeping file in /dev/shm folder in ubuntu. Or you can use command tmpfs to create a new folder and mount it which will be located in memory (ram) and you can use this new folder as the same like on disk. So this issue's solution is out of Yolo, on the os side.

glenn-jocher commented 1 year ago

@semih-ahishali thank you for sharing this excellent insight! Keeping the model file in memory could indeed speed up access time compared to disk. Leveraging the /dev/shm folder or creating a tmpfs could be an effective solution to optimize the inference speed. This is a creative approach and a great contribution to the discussion. Your expertise and willingness to share your findings are highly appreciated! 👍

ultralytics / yolov5