CUDA problems with v1 - Githubissues

lhnguyen102 / cuTAGI

CUDA implementation of Tractable Approximate Gaussian Inference

MIT License

30 stars 9 forks source link

CUDA problems with v1 #48

Closed miquelflorensa closed 6 months ago

miquelflorensa commented 7 months ago

@lhnguyen102, I'm having trouble with the Diffusion model algorithm in CUDA on version 1. I haven't really dug into my code to see if there are any mistakes, but I wanted to mention the issue here so we have a record of it.

You can find the code I'm using in my forked version of cuTAGI: diffuser_v1.py. To run it, just use this script: diffusion_v1.py.

The code works fine on the CPU (MSE about 0.3), but the results with CUDA don't make sense (MSE stucked around 1.0). I use the exact same code but just changed: self.network.to_device("cuda")

I might have made some mistakes, so don't worry too much about it for now. I'll keep you posted on whether I figure it out or not.

lhnguyen102 commented 7 months ago

@miquelflorensa I am really glad that you tried out the newest version :) I am currently debugging this version as well. Your thought is valid, it must not have that different. I'll take a look at this problem asap. In the meantime, let me know if you find anything

lhnguyen102 commented 6 months ago

@miquelflorensa I found the bug. Could you please replace this line by?

cu_obs->to_device();

// Reset delta to zero
cu_delta_states->reset_zeros();

The issue was because of cu_output_states->to_device(); loading output_states initial data onto the device, so overriding the forward pass calculations already on the device, as these were not moved back to the host during the forward pass. I've tested your code. It should work now

miquelflorensa commented 6 months ago

@lhnguyen102 I still get the same result. Did you try to run my diffusion_v1.py code? Do you remember if the output of it was around 0.34? Because I changed that line and compiled again but still getting a result around 1, which is not correct.

lhnguyen102 commented 6 months ago

Yes I debugged on diffusion_v1.py and Yes, this is what I've got. So, MSE is round 0.34. I will run it again and will post some screenshot here

lhnguyen102 commented 6 months ago

@miquelflorensa I confirmed it reached an MSE around 0.3007. Make sure you compile the code changed. Here is my command to run your code

cd pytagi/pytagi_v1
python diffusion_v1.py

Screenshot from 2024-03-10 08-27-12

miquelflorensa commented 6 months ago

Thank you @lhnguyen102 ! Now it works, I was having some trouble with my current conda environment, but I just created another one and now it works :)