Bigger Than Memory ops should automatically fallback to RAM and/or Disc in tf-gpu

bionicles commented 5 years ago

System information

TensorFlow version (you are using): 1.14rc
Are you willing to contribute it (Yes/No): Yes

Describe the feature and the current behavior/state. We built an Atomic Transformer with Keras, and if we try to simulate too many atoms, like a big antibody or enzyme with 50,000 atoms, or a big metabolic network in system biology, it crashes because the GPU memory of the 1070ti fills up and there is no backup plan for when that happens.

I believe this is an issue for many many advanced uses of TF because we sometimes need to process "BIG DATA" larger than the VRAM and Dask is the ONLY available option, and Dask is often extremely slow and still crashes, plus we have to convert tensors into dask arrays, it's a huge hassle. Just imagine TF-GPU intelligently predicted memory usage and used cpu/ram/drive resources as needed.

Obviously, this should print performance warnings in the console, and would be slower, but at least there should be some option for tf-gpu to try to keep working in cases where the workload doesn't fit in VRAM

Will this change the current api? How? TF-GPU would either automatically scale beyond GPU / TPU memory, or there would be some way to tell it to do this in a python script. Like tf.enable_larger_than_memory()

Who will benefit with this feature? Everyone who pushes TF-GPU to the breaking point

Any Other info. UBUNTU 16.04 NVIDIA GTX 1070Ti

smatzek commented 5 years ago

Training or inferencing with neural networks on GPUs when the neural network, batch size, or data is large is a difficult thing. There is more than just data on the GPU during the processing. The neural network, the GPU kernels, the data, and the input and output tensors of each operation all need to fit in the GPU memory. Even when focusing on a single GPU operation, both the input and output tensors need to fit and this can also sometimes be a problem.

A Python module exists that could possibly help out your case. It is called TensorFlow Large Model Support. TensorFlow Large Model Support (TFLMS) is a Python module that provides an approach to training large models and data that cannot normally be fit in to GPU memory. It takes a computational graph defined by users, and automatically adds swap-in and swap-out nodes for transferring tensors from GPUs to the host and vice versa. During training and inferencing this makes the graph execution operate like operating system memory paging. The system memory is effectively treated as a paging cache for the GPU memory and tensors are swapped back and forth between the GPU memory and CPU memory.

TFLMS is included in the IBM Watson Machine Learning Community Edition, which is free to use, and can be installed as a conda package on both the x86 and ppc64le platforms. The conda channel URL and other documentation for TFLMS can be found here: https://www.ibm.com/support/knowledgecenter/SS5SF7_1.6.1/navigation/wmlce_getstarted_tflmsv2.html

More information about TFLMS, including the papers, videos, and blogs can be found here: https://developer.ibm.com/linuxonpower/2019/06/11/tensorflow-large-model-support-resources/

aaroey commented 5 years ago

@bionicles does setting per_process_gpu_memory_fraction to be >1.0 work for you?

Also @jaingaurav

bionicles commented 5 years ago

Thank you for the advice, we will test these ideas and report back

jaingaurav commented 5 years ago

@aaroey: We haven't exposed the unified_memory option in the tf.config API namespace yet. Seems like this is needed for 2.0?

bionicles commented 5 years ago

@jaingaurav @aaroey @smatzek @ymodak @gadagashwini

yes! how do you set this in tf2.0 ? we're maxing out on memory and can't find the per_process_gpu_memory_fraction option since we switched from 1.14 to 2.0.0b1

bionicles commented 5 years ago

sometimes we can do 30,000 atoms, if there were GPU support for bfloat16 and also we could unify memory, then we could push to bigger molecules

bionicles commented 5 years ago

actually, i just tested it, looks like we can indeed use bfloat16 on gpu...

bionicles commented 5 years ago

nevermind. there are issues with it... there's no "squared difference" kernel for bfloat16 on gpu

bionicles commented 5 years ago

30k atoms works with tf.float16 though!

aaroey commented 4 years ago

@sanjoy could you help to find an owner of this issue? Thanks

sanjoy commented 4 years ago

Assigned to Gaurav since it looks like the core issue is https://github.com/tensorflow/tensorflow/issues/29840#issuecomment-507066169?

JF-D commented 4 years ago

I am from #43049 . I follow https://stackoverflow.com/questions/58025069/how-to-enable-cuda-unified-memory-in-tensorflow-v2. With the following instructions, I can use unified memory with single gpu training (My machine has 8 GPUs).

from tensorflow.compat.v1 import ConfigProto
from tensorflow.compat.v1 import InteractiveSession

config = ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 2
config.gpu_options.allow_growth = True
session = InteractiveSession(config=config)

But for multi-gpu training, it reports the folllowing error. I use horovod synthetic benchmark. https://github.com/horovod/horovod/blob/master/examples/tensorflow2_synthetic_benchmark.py Can you offer some help? Thanks a lot.

GF-Huang commented 3 years ago

Does TFLMS supports TPU training?

smatzek commented 3 years ago

The latest TFLMS versions which support TensorFlow 2 do not support TPUs as they are built directly into the memory allocator code for GPUs. The TFLMSv2 version, which is "version 2" of the Python graph modification LMS version could possibly be made to work with TPUs, but it only support graph mode with TensorFlow 1.x.

https://github.com/IBM/tensorflow-large-model-support

GF-Huang commented 3 years ago

The latest TFLMS versions which support TensorFlow 2 do not support TPUs as they are built directly into the memory allocator code for GPUs. The TFLMSv2 version, which is "version 2" of the Python graph modification LMS version could possibly be made to work with TPUs, but it only support graph mode with TensorFlow 1.x.

And does LMS supports tf-2.3.x?

smatzek commented 3 years ago

@GF-Huang We open sourced all versions of TensorFlow LMS, including the version for 2.2.0. Unfortunately the dev team that worked on it has no current plans to release versions for newer TensorFlow versions. Having another community member pick it up, work on it, and potentially get it merged in to the main TensorFlow repository would be welcomed.

chaithyagr commented 1 year ago

@GF-Huang We open sourced all versions of TensorFlow LMS, including the version for 2.2.0. Unfortunately the dev team that worked on it has no current plans to release versions for newer TensorFlow versions. Having another community member pick it up, work on it, and potentially get it merged in to the main TensorFlow repository would be welcomed.

@smatzek thank you for your codes. I am actively setting this up on my side now and hope everything should run smoothly.

I actively work with large models. I will be interested to merge this in some form. Can you please point me to the repository? (Not just the patch files, but maybe a repository with modified tensorflow-gpu codes?

smatzek commented 1 year ago

@chaithyagr, I don't have a github repository with the patches applied. To get a full git repo with the code you could clone the tensorflow repository, checkout the 2.2.0 tag, and then apply the patch. That should allow you to build tensorflow-gpu at the 2.2.0 level with that functionality.

chaithyagr commented 1 year ago

@smatzek Yes indeed.. Thats the best way to work forward i guess, but then it gets reallly hard to move forward from there right... I am expecting a lot of changes happened over time in meantime in these files..

The thing is, I really want to use this, but most of my codes (even some codes in GOOGLE_CUDA), are based on tensorflow-2.9 ... So it would mean recompiling them all for tensorflow 2.2 with these patches

github-actions[bot] commented 1 year ago

This issue is stale because it has been open for 180 days with no activity. It will be closed if no further activity occurs. Thank you.

github-actions[bot] commented 4 months ago

This issue was closed because it has been inactive for 1 year.

tensorflow / tensorflow

Bigger Than Memory ops should automatically fallback to RAM and/or Disc in tf-gpu #29840