RuntimeError: CUDA out of memory.

sonigfr commented 2 years ago

Hi lec0dex, thanks to try to make this code work, unfortunately I have a CUDA out of memory also with it. I'm with a 1070 card, on Ubuntu 20.04 LTS, default nvidia 470 and cuda 10.1 drivers, prerequisites fully respected, replacing default images by 1080p ones on default directories, I have the following error :

(FGVC) user@computer:~/FGVC$ python 1-create-flow.py --mode object_removal --path ./data/tennis --path_mask ./data/tennis_mask --outroot ./result/tennis_removal /home/user/anaconda3/envs/FGVC/lib/python3.8/site-packages/ray/autoscaler/_private/cli_logger.py:57: FutureWarning: Not all Ray CLI dependencies were found. In Ray 1.4+, the Ray CLI, autoscaler, and dashboard will only be usable via pip install 'ray[default]'. Please update your install command. warnings.warn( 2021-08-20 02:53:10,909 INFO services.py:1245 -- View the Ray dashboard at http://127.0.0.1:8265 1-create-flow.py:370: DeprecationWarning: np.int is a deprecated alias for the builtin int. To silence this warning, use int by itself. Doing this will not modify any behavior and is safe. When replacing np.int, you may wish to use e.g. np.int64 or np.int32 to specify the precision. If you wish to review your current use, check the release note link for additional information. Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations chunks=(10, 1), fillvalue=0, dtype=np.int) 1-create-flow.py:376: DeprecationWarning: np.bool is a deprecated alias for the builtin bool. To silence this warning, use bool by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.boolhere. Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations chunks=(imgH, imgW, 1), fill_value=0, dtype=np.bool) #3 1-create-flow.py:378: DeprecationWarning:np.boolis a deprecated alias for the builtinbool. To silence this warning, useboolby itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, usenp.bool_here. Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations chunks=(imgH, imgW, 1), fill_value=0, dtype=np.bool) #4 1-create-flow.py:380: DeprecationWarning:np.boolis a deprecated alias for the builtinbool. To silence this warning, useboolby itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, usenp.bool_` here. Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations chunks=(imgH, imgW, 1), fill_value=0, dtype=np.bool) #5 Importing frames: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 350/350 [00:22<00:00, 15.34it/s] Calculating flows: 0%| | 0/349 [00:00<?, ?it/s]Traceback (most recent call last): File "1-create-flow.py", line 487, in main(args) File "1-create-flow.py", line 452, in main video_completion_seamless(args) File "1-create-flow.py", line 400, in video_completion_seamless rayProgressBar( File "1-create-flow.py", line 43, in rayProgressBar ray.get(done) File "/home/user/anaconda3/envs/FGVC/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 82, in wrapper return func(*args, kwargs) File "/home/user/anaconda3/envs/FGVC/lib/python3.8/site-packages/ray/worker.py", line 1564, in get raise value.as_instanceof_cause() ray.exceptions.RayTaskError(RuntimeError): ray::calculate_flow(0) (pid=2558, ip=192.168.0.001) File "python/ray/_raylet.pyx", line 534, in ray._raylet.execute_task File "/home/user/anaconda3/envs/FGVC/lib/python3.8/site-packages/ray/util/tracing/tracing_helper.py", line 330, in _function_with_tracing return function(*args, *kwargs) File "1-create-flow.py", line 203, in calculateflow , flow = model(prevf, nextf, iters=20, test_mode=True) File "/home/user/anaconda3/envs/FGVC/lib/python3.8/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(input, kwargs) File "/home/user/FGVC/RAFT/raft.py", line 108, in forward corr_fn = CorrBlock(fmap1, fmap2, radius=self.args.corr_radius) File "/home/user/FGVC/RAFT/corr.py", line 19, in init corr = CorrBlock.corr(fmap1, fmap2) File "/home/user/FGVC/RAFT/corr.py", line 58, in corr corr = torch.matmul(fmap1.transpose(1,2), fmap2) RuntimeError: CUDA out of memory. Tried to allocate 3.91 GiB (GPU 0; 7.93 GiB total capacity; 211.88 MiB already allocated; 1.31 GiB free; 386.00 MiB reserved in total by PyTorch) Calculating flows: 0%|

In trying to run the command again without doing anything, I have another error :

(FGVC) user@computer:~/FGVC$ python 1-create-flow.py --mode object_removal --path ./data/tennis --path_mask ./data/tennis_mask --outroot ./result/tennis_removal /home/user/anaconda3/envs/FGVC/lib/python3.8/site-packages/ray/autoscaler/_private/cli_logger.py:57: FutureWarning: Not all Ray CLI dependencies were found. In Ray 1.4+, the Ray CLI, autoscaler, and dashboard will only be usable via pip install 'ray[default]'. Please update your install command. warnings.warn( 2021-08-20 03:32:17,856 INFO services.py:1245 -- View the Ray dashboard at http://127.0.0.1:8265 1-create-flow.py:370: DeprecationWarning: np.int is a deprecated alias for the builtin int. To silence this warning, use int by itself. Doing this will not modify any behavior and is safe. When replacing np.int, you may wish to use e.g. np.int64 or np.int32 to specify the precision. If you wish to review your current use, check the release note link for additional information. Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations chunks=(10, 1), fillvalue=0, dtype=np.int) 1-create-flow.py:376: DeprecationWarning: np.bool is a deprecated alias for the builtin bool. To silence this warning, use bool by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.boolhere. Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations chunks=(imgH, imgW, 1), fill_value=0, dtype=np.bool) #3 1-create-flow.py:378: DeprecationWarning:np.boolis a deprecated alias for the builtinbool. To silence this warning, useboolby itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, usenp.bool_here. Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations chunks=(imgH, imgW, 1), fill_value=0, dtype=np.bool) #4 1-create-flow.py:380: DeprecationWarning:np.boolis a deprecated alias for the builtinbool. To silence this warning, useboolby itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, usenp.bool_` here. Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations chunks=(imgH, imgW, 1), fill_value=0, dtype=np.bool) #5 Importing frames: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 350/350 [00:03<00:00, 98.12it/s] Calculating flows: 0%| | 0/349 [00:00<?, ?it/s]Traceback (most recent call last): File "1-create-flow.py", line 487, in main(args) File "1-create-flow.py", line 452, in main video_completion_seamless(args) File "1-create-flow.py", line 400, in video_completion_seamless rayProgressBar( File "1-create-flow.py", line 43, in rayProgressBar ray.get(done) File "/home/user/anaconda3/envs/FGVC/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 82, in wrapper return func(*args, kwargs) File "/home/user/anaconda3/envs/FGVC/lib/python3.8/site-packages/ray/worker.py", line 1564, in get raise value.as_instanceof_cause() ray.exceptions.RayTaskError(RuntimeError): ray::calculate_flow(0) (pid=7022, ip=192.168.0.001) File "python/ray/_raylet.pyx", line 534, in ray._raylet.execute_task File "/home/user/anaconda3/envs/FGVC/lib/python3.8/site-packages/ray/util/tracing/tracing_helper.py", line 330, in _function_with_tracing return function(*args, *kwargs) File "1-create-flow.py", line 203, in calculateflow , flow = model(prevf, nextf, iters=20, test_mode=True) File "/home/user/anaconda3/envs/FGVC/lib/python3.8/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(input, kwargs) File "/home/user/FGVC/RAFT/raft.py", line 101, in forward fmap1, fmap2 = self.fnet([image1, image2]) File "/home/user/anaconda3/envs/FGVC/lib/python3.8/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, *kwargs) File "/home/user/FGVC/RAFT/extractor.py", line 176, in forward x = self.conv1(x) File "/home/user/anaconda3/envs/FGVC/lib/python3.8/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(input, **kwargs) File "/home/user/anaconda3/envs/FGVC/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 419, in forward return self._conv_forward(input, self.weight) File "/home/user/anaconda3/envs/FGVC/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 415, in _conv_forward return F.conv2d(input, weight, self.bias, self.stride, RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED Calculating flows: 0%| | 0/349 [00:05<?, ?it/s] (FGVC) user@computer:~/FGVC$

Thanks for your enlightenment.

lec0dex commented 2 years ago

Please, replace @ray.remote(num_gpus = 0.5) to @ray.remote(num_gpus = 1) in all occurrences in the 1-create-flow.py. This will create and allocate memory to only one processing instance into GPU for flow creation. Unfortunately this will reduce processing speed.

lec0dex commented 2 years ago

I'm yet to evaluate the optimization made by @XinyingWang55 at https://github.com/XinyingWang55/FGVC/commit/28dcfe2b940120eb34c81387c12547e7cbaa3e0f and see if this fork can be further optimized, but It's not gonna happen soon. So if you have the skills you may try to implement and do some tests and make a pull request.

sonigfr commented 2 years ago

Hello lec0dex, thank you for your reply, unfortunately with @ray.remote(num_gpus = 1), it changes nothing I have the same CUDA out of memory error. About your proposition to rework the XinyingWang55 code, it would be with pleasure, I have no code experience unfortunately, I'm working on the image and in postproduction and am kind of new in Linux, python and Machine learning environment, I have a good computer and network background but coding in python is my limit I suppose, I've just tested her code and guess what ? I also received a CUDA out of memory result. I can't see an issue section on her project and don't seem to Pull Request anything for this ? There are some great things which are emerging and that is why I'm here. We encounter a lot of CUDA out of memory error in many Machine Learning for video oriented projects, when you exceed a certain image resolution put together with your NVIDIA Cuda video card VRAM availability. I understood with 8GB of VRAM on my 1070 card I can't expect exceeding around 540p resolution most of the time. I suppose to process 1080p you need about between 16 and 24GB, maybe even further depending the code quality and project. And for 4k even further maybe between 32 and 64GB. So either it is a beautiful gamble on the future when video cards with so much memory will be affordable and/or available or the code should be adapted as you tried to do to make it works in gaining some more management and popularity. Anyway, maybe there should be other options, like systematically allow users to choose a last a more convenient in the sacrifice of a much longer cpu time processing, I think it should be interesting for all of us. I'm confident step by step all of this technology will find his way in a more accessible deployment. I think Conda and PyTorch technology is the witness of this kind of effort. Different linux/distribution technologies and versioning in time is also often an issue. I am not coding but I can talk a bit about it ! ;) Here is the new log response to your new proposition, and thank you again a lot about your efforts :

(FGVC) user@computer:~/FGVC$ python 1-create-flow.py --mode object_removal --path ./data/tennis --path_mask ./data/tennis_mask --outroot ./result/tennis_removal \

/home/user/anaconda3/envs/FGVC/lib/python3.8/site-packages/ray/autoscaler/_private/cli_logger.py:57: FutureWarning: Not all Ray CLI dependencies were found. In Ray 1.4+, the Ray CLI, autoscaler, and dashboard will only be usable via pip install 'ray[default]'. Please update your install command. warnings.warn( 2021-08-27 16:58:21,454 INFO services.py:1245 -- View the Ray dashboard at http://127.0.0.1:8265 1-create-flow.py:370: DeprecationWarning: np.int is a deprecated alias for the builtin int. To silence this warning, use int by itself. Doing this will not modify any behavior and is safe. When replacing np.int, you may wish to use e.g. np.int64 or np.int32 to specify the precision. If you wish to review your current use, check the release note link for additional information. Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations chunks=(10, 1), fillvalue=0, dtype=np.int) 1-create-flow.py:376: DeprecationWarning: np.bool is a deprecated alias for the builtin bool. To silence this warning, use bool by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.boolhere. Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations chunks=(imgH, imgW, 1), fill_value=0, dtype=np.bool) #3 1-create-flow.py:378: DeprecationWarning:np.boolis a deprecated alias for the builtinbool. To silence this warning, useboolby itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, usenp.bool_here. Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations chunks=(imgH, imgW, 1), fill_value=0, dtype=np.bool) #4 1-create-flow.py:380: DeprecationWarning:np.boolis a deprecated alias for the builtinbool. To silence this warning, useboolby itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, usenp.bool_` here. Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations chunks=(imgH, imgW, 1), fill_value=0, dtype=np.bool) #5 Importing frames: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:01<00:00, 25.15it/s] Calculating flows: 0%| | 0/49 [00:00<?, ?it/s]Traceback (most recent call last): File "1-create-flow.py", line 487, in main(args) File "1-create-flow.py", line 452, in main video_completion_seamless(args) File "1-create-flow.py", line 400, in video_completion_seamless rayProgressBar( File "1-create-flow.py", line 43, in rayProgressBar ray.get(done) File "/home/user/anaconda3/envs/FGVC/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 82, in wrapper return func(*args, kwargs) File "/home/user/anaconda3/envs/FGVC/lib/python3.8/site-packages/ray/worker.py", line 1564, in get raise value.as_instanceof_cause() ray.exceptions.RayTaskError(RuntimeError): ray::calculate_flow(0) (pid=3127, ip=10.8.202.138) File "python/ray/_raylet.pyx", line 534, in ray._raylet.execute_task File "/home/user/anaconda3/envs/FGVC/lib/python3.8/site-packages/ray/util/tracing/tracing_helper.py", line 330, in _function_with_tracing return function(*args, *kwargs) File "1-create-flow.py", line 203, in calculateflow , flow = model(prevf, nextf, iters=20, test_mode=True) File "/home/user/anaconda3/envs/FGVC/lib/python3.8/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(input, kwargs) File "/home/user/FGVC/RAFT/raft.py", line 108, in forward corr_fn = CorrBlock(fmap1, fmap2, radius=self.args.corr_radius) File "/home/user/FGVC/RAFT/corr.py", line 19, in init corr = CorrBlock.corr(fmap1, fmap2) File "/home/user/FGVC/RAFT/corr.py", line 60, in corr return corr / torch.sqrt(torch.tensor(dim).float()) RuntimeError: CUDA out of memory. Tried to allocate 3.91 GiB (GPU 0; 7.93 GiB total capacity; 4.09 GiB already allocated; 2.36 GiB free; 4.29 GiB reserved in total by PyTorch) Calculating flows: 0%| | 0/49 [00:04<?, ?it/s] (FGVC) user@computer:~/FGVC$

lec0dex commented 2 years ago

I'm sorry. I reviewed the code and forgot that I've already implemented performance profiles for GPU. I believe the workaround I suggested is useless as it defines at runtime the profile to low as default if not specified.

With this implementation I was able to work with 1080p with only 8GB of GPU RAM at low profile. It's unfortunate that didn't work for you. I'll try it again and see if it there's something different and will reach back to you.

As for the profile usage, there're three options available by using the parameter --gpu_profile low|medium|high. This will simple split work from one to four of CPU-GPU cores respectively. The total possible jobs running at the same time is limited by the number of CPU cores and amount of G/DDR available, but the limit to the maximum 4 cores was hardcoded by myself as I didn't implemented the process to check system resource available at runtime. In the case of using more robust machine, the user can tune this directly in the code.

I'll share you a secret. This is the first python project I ever messed with, and I got into it exactly by the same reason as you. I couldn't test the code in my low-end machine, so I got frustrated and spent 6 months trying to make it work and this was the result. I guess it's only a matter of time to have some more optimized and powerful version of this algorithm implemented on softwares like DaVinci Resolve.

sonigfr commented 2 years ago

Hi lec0dex, I thank you for your answer, well all I can tell you more are, I have a i7 3770K processor and 32 GB of ram, If I really want to enjoy these kind of new technologies I plan to upgrade my machine anyway, and you are right it will and begin to be implemented on regular softwares in any form anyway and it can only become better and better. I suppose a good 24 or even 32 GB of video card VRAM is a good answer for all these "between research and in-production" solutions right now and is a joyful thing to discover anyway because it is the future of graphical computing science, working at home right now or not. Thanks for your work and we keep in touch.

MyaaMyaa commented 2 years ago

@XinyingWang55 's fork works faster but produces a somewhat worse result. Makes sense. Depends on a video really. I dunno It's all in Chinese. It says 8gb of vram is enough. But it sounds like there's some kinda competition and stuff and even low resolutions are mentioned. 1080p still give you CUDA out of memory. His fork requires installing CuPy. It could probably run more frames in lower resolutions than the original code

XinyingWang55 commented 2 years ago

Our company(MGTV) organized a video inpainting competition in this summer, all video dataset we applied are all from TV shows. This is the background of my work ( @XinyingWang55https://github.com/XinyingWang55)

The pipeline of FGVC is: 1.Computing dense optical flow (RAFT) 2.Computing edge (Canny) 3.Connect Edge (EdgeConnect) 4.Inpaint optical flow（Solve Ax=b, I modified this part. Solve Ax=b is really slow, so I crop the image rather than using whole image to create A and b） 5.RGB color propagation

I modified the FGVC a little bit and provided it to our participator as a tutorial for beginners. FGVC is formed by several algorithms, I believe it is more friendly to beginners compare with some end-to-end models like STTN(hard to train), that is why I chose FGVC as a baseline for our competition. In the original work, the GPU memory is smaller than 8G, that is enough for our competition. In my work, the CPU physical memory is sometimes larger than 8G in our dataset (depends on the size of missing part) even though I make the A、x and b smaller.

MyaaMyaa commented 2 years ago

Our company(MGTV) organized a video inpainting competition in this summer, all video dataset we applied are all from TV shows. This is the background of my work ( @XinyingWang55https://github.com/XinyingWang55) The pipeline of FGVC is: 1.Computing dense optical flow (RAFT) 2.Computing edge (Canny) 3.Connect Edge (EdgeConnect) 4.Inpaint optical flow（Solve Ax=b, I modified this part. Solve Ax=b is really slow, so I crop the image rather than using whole image to create A and b） 5.RGB color propagation I modified the FGVC a little bit and provided it to our participator as a tutorial for beginners. FGVC is formed by several algorithms, I believe it is more friendly to beginners compare with some end-to-end models like STTN(hard to train), that is why I chose FGVC as a baseline for our competition. In the original work, the GPU memory is smaller than 8G, that is enough for our competition. In my work, the CPU physical memory is sometimes larger than 8G in our dataset (depends on the size of missing part) even though I make the A、x and b smaller.

Yeah that's what I gathered from auto translation. I hoped I could use your fork on 1080p frames since the original code loads everything into VRAM and can't handle high rez. I've found a way to cut corners by cropping frames to the area where I wanna an object removed plus enough of surrounding background info for FGVC to source from. But it obviously can't work on every footage Sometimes larger? So 1080p frames can be handled by your fork under some circumstances? If so how do I make this happen?

Thanks for the fork anyway. FGVC does a way better job handling complex shots with minimum input than BorisFX Mocha which is old fart software where you have to do everything manually and see if you can get better result with some ooga booga method on difficult shots and saw no changes in 20 years. Any other software doesn't even do any better. Your fork makes FGVC do its job faster though not as good but that's what good about when comparing to Mocha. That's what I need sometimes.

XinyingWang55 commented 2 years ago

Our frame size is 576x1024 and each video contains ~140 frames. The maximum VRAM is lower than 8G and the average VRAM is around 3~5G. I have to say, my code didn't reduce the VRAM usage. I still use whole image to calculates optical flow and complete flow edge.

lec0dex / FGVC

RuntimeError: CUDA out of memory. #3