GPU-Acceleration Configuration

dulloa21 commented 1 month ago

Hello! I am currently working on a version of Tissue Forge installed with GPU-acceleration (tf.has_cuda = 1), however, data usage shows that GPU usage is at 0% while a simulation is running. I have tried to run cuda_config_bonds = tf.Simulator.getCUDAConfig().bonds from the documentation, however I receive the following error: 'SimulatorInterface' object has no attribute 'getCUDAConfig'. I was wondering how I could resolve this issue and begin offloading work to the GPU. I am currently running a 100x100 area simulation with about 150 objects in the simulation(excluding bonds).

Any help is greatly appreciated. Thank you!!

tjsego commented 1 month ago

Try replacing tf.Simulator.getCUDAConfig() with tf.Simulator.cuda_config.

Also, depending on how many bonds are in your simulation, you may not see performance improvement from GPU acceleration. Typically bonded interactions don't benefit much from GPU acceleration when there aren't many bonds.

dulloa21 commented 1 month ago

Hi thank you for your help. I used tf.Simulator.cuda_config and that works well. However, tf.Simulator.cuda_config.bonds says there is no attribute. Is there another way to access bonds? I do not see it in the attribute list for cuda_config. I actually plan on having possibly several hundred bonds so this acceleration would be helpful.

tjsego commented 1 month ago

What is the output of print(type(tf.Simulator.cuda_config), dir(tf.Simulator.cuda_config))?

dulloa21 commented 1 month ago

The output is <class 'NoneType'> ['__bool__', '__class__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__']

tjsego commented 1 month ago

This result strongly suggests that CUDA support wasn't built. What is the value of tf.has_cuda?

dulloa21 commented 1 month ago

The value is 1. Daisy Ulloa

On Wed, Jul 10, 2024 at 2:01 PM tjsego @.***> wrote:

This result strongly suggests that CUDA support wasn't built. What is the value of tf.has_cuda?

— Reply to this email directly, view it on GitHub https://github.com/tissue-forge/tissue-forge/issues/70#issuecomment-2221457957, or unsubscribe https://github.com/notifications/unsubscribe-auth/BFWJXRFOJPVPQXX7CR3RHIDZLWOKDAVCNFSM6AAAAABKTXVKN6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMRRGQ2TOOJVG4 . You are receiving this because you authored the thread.Message ID: @.***>

tjsego commented 1 month ago

Thanks. Can you confirm that the example cell_sorting_cuda.py utilizes the GPU? I'm trying to determine whether this issue is particular to the build environment, Python API, or something else entirely.

dulloa21 commented 1 month ago

Hi. I have ran the example code and the simulation itself runs fine, however, there is an error with the benchmark. Running the benchmark gives the following: `** Benchmarking

Sending engine to GPU... -2147467259` This portion is repeated multiple times before the kernel dies. It does not hit any of the other benchmarks besides this one.

tjsego commented 1 month ago

Thanks, that's helpful. Seems like there's an issue with the build, but let's confirm with the following. Right after the call to tf.init, add tf.Logger.enableFileLogging('tf_log.txt', tf.Logger.TRACE) and then repeat the demo. The demo will create a file tf_log.txt in the current working directory that contains information about what was executed and (hopefully) what went wrong. Can you complete these steps and upload the contents of the created file?

dulloa21 commented 1 month ago

I have attached the log file below. tf_log (1).txt

tjsego commented 4 weeks ago

Thanks. The following errors were reported:

ERROR: Code: -2147467259, Msg: system has unsupported display driver / cuda driver combination, File: /opt/conda/tissue-forge/source/mdcore/src/tfEngine_cuda.cu, Line: 0, Function: HRESULT TissueForge::cuda::engine_cuda_setdevice(TissueForge::engine*, int), func: HRESULT TissueForge::errSet(HRESULT, const char*, int, const char*, const char*), file:/opt/conda/tissue-forge/source/tfError.cpp,lineno:73

ERROR: Code: -2147467259, Msg: Failed to set device., File: /opt/conda/tissue-forge/source/cuda/tfEngineConfig.cpp, Line: 0, Function: HRESULT TissueForge::cuda::EngineConfig::setDevice(int), func: HRESULT TissueForge::errSet(HRESULT, const char*, int, const char*, const char*), file:/opt/conda/tissue-forge/source/tfError.cpp,lineno:73

It looks like you may need to update your drivers or find some compatible versions. This one is pretty tough for me to help with generally, but one easy fix might be along the following: what GPU are you using, and what did you set CUDAARCHS to for the build?

dulloa21 commented 3 weeks ago

There are two possible GPU choices as tissue forge is running on a cluster. NVIDIA L40 (Driver version: 550.90.07) or NVIDIA A100 (Driver version: 550.76) and Tissue Forge was built with CUDAARCHS set to 80.

tjsego commented 3 weeks ago

Ok so Tissue Forge was built to target the A100. If the L40 is the default device (device 0), then you may need to target the A100 manually.

Unfortunately, the online docs are populated from the API without CUDA support, so the CUDA interface doesn't have supporting API docs for the Python interface. But here are the API features to try this fix:

tf.cuda.getNumDevices will return the number of GPUs available. Likely for you, it will return "2". tf.cuda.getDeviceName will return a string of the name that corresponds to a passed device id integer. The default device has id 0. Likely the second device (hopefully the A100) has device id 1. tf.Simulator.cuda_config.bonds and tf.Simulator.cuda_config.engine both have the following methods:

getDevice returns the integer id of the currently targeted device.
setDevice sets the currently targeted device to a passed integer id.

My guess is, you can verify the device id of the A100 with tf.cuda.getDeviceName and pass it to setDevice of the module you want to run on the A100. You can also adjust your build to target both GPUs by adding the compute capability of the L40 (8.9).

Let me know how that goes.

dulloa21 commented 3 weeks ago

Hi thank you for the help. I went ahead and ran the getNumDevices function and it interestingly enough returned 0. The getDeviceName function also returned \x06 when checking the default device.

tjsego commented 3 weeks ago

Ok would it be possible to package and share the build log? If you're uncomfortable with uploading here, direct email would work: timothy (dot) sego [at] medicine (dot) ufl (dot) edu. There are a number of ways you could get this result (especially since you're building on HPC) that might be best determined by reviewing the details of the build.

tissue-forge / tissue-forge

GPU-Acceleration Configuration #70