lhnguyen102 / cuTAGI

CUDA implementation of Tractable Approximate Gaussian Inference
MIT License
24 stars 9 forks source link

cannot use "cuda" with the python codes #46

Closed bhargobdeka closed 4 months ago

bhargobdeka commented 5 months ago
Screen Shot 2024-02-07 at 12 24 17 PM

It says that data could not be transferred to device. Let me now if there is a fix around this. Thanks!

bhargobdeka commented 5 months ago

Will try to create the same for the new pyTAGI as it seems to be running well on both cpu and CUDA.

lhnguyen102 commented 5 months ago

@miquelflorensa and @jamesgoulet are you facing the same issue? I am off the grid for a couple of days. I will address the issue in a couple of days if it won't be resolved

jamesgoulet commented 5 months ago

@lhnguyen102 Yes, I have tried and I run in the same issue. I can run GPU for both the C++ code and the pyTAGI_v1. In case you will deprecate the old version once pyTAGI_V1 will be finalize, we can wait for the new one...

lhnguyen102 commented 4 months ago

I will take a look at it for a quick fix. In general, you don't really need the CUDA for two fully-connected layer of 50 hidden units.

jamesgoulet commented 4 months ago

πŸ‘ŒπŸ™. I think that @bhargobdeka idea was to create config files for the large's UCI regression dataset so that we can post a repeatable baseline.

bhargobdeka commented 4 months ago

Hi Ha,

Hoping I am disturbing you in the middle of your vacation :-)

It will be nice to have all the benchmark results available in the Pytagi package with both cpu and CUDA. I will try to put the other methods, e..g., deep ensemble, MC dropout, PBP, Laplace etc in the python package also so that we have a holistic comparison for the regression datasets, both small and large. The large ones need to run with CUDA as it will take forever otherwise. I will also try to see how to fix the CUDA problem together with Miguel.

Regards, Bhargob Deka Postdoctoral Researcher Department of Civil, Geologic, and Mining Polytechnique MontrΓ©al, QC @.*** LinkedIn: https://www.linkedin.com/in/bhargob-deka-ph-d-87aba7117/

On Feb 12, 2024, at 6:10 AM, James-A. Goulet @.***> wrote:

πŸ‘ŒπŸ™. I think that @bhargobdeka https://github.com/bhargobdeka idea was to create config files for the large's UCI regression dataset so that we can post a repeatable baseline.

β€” Reply to this email directly, view it on GitHub https://github.com/lhnguyen102/cuTAGI/issues/46#issuecomment-1938472174, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQS77FTGUBYTGCLQERZAFKTYTH2D7AVCNFSM6AAAAABC6HNPMKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMZYGQ3TEMJXGQ. You are receiving this because you were mentioned.

lhnguyen102 commented 4 months ago

No worries and understood! @jamesgoulet how CUDA is installed in Ubuntu server? It seems to me that CUDA has not been installed at least for me.

pytagi) lhn@MLCIVS1:~/lhn_tagi/cuTAGI$ nvcc --version
Command 'nvcc' not found, but can be installed with:
sudo apt install nvidia-cuda-toolkit

I could install it if needed but a good practice is to install a version for all users because the size of CUDA installation is large. In addition, I found these instructions are the best to install CUDA on Ubuntu 22.04. I have done that before for @miquelflorensa on my desktop. Note that CUDA version should be 12.2 because of its current Nvidia driver

nvidia-smi
Tue Feb 13 04:01:12 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.154.05             Driver Version: 535.154.05   CUDA Version: 12.2     |
lhnguyen102 commented 4 months ago

Nevermind! I figured it out

lhnguyen102 commented 4 months ago

@lhnguyen102 Yes, I have tried and I run in the same issue. I can run GPU for both the C++ code and the pyTAGI_v1. In case you will deprecate the old version once pyTAGI_V1 will be finalize, we can wait for the new one...

@jamesgoulet, I couldn't even run the newest version using GPU on Ubuntu server from the main branch. It gave some weird error... If I recall correctly, everything works just fine on my desktop (I couldn't had access to it now for testing). Could you please test it again? Thanks. Here is my command on terminal

build/main test_fc_mnist

Please make sure uncomment this line in order to run the test on GPU

lhnguyen102 commented 4 months ago

I might have a hint. Here is the error

Program hit cudaErrorUnsupportedPtxVersion (error 222) due to "the provided PTX was compiled with an unsupported toolchain." on CUDA API call to cudaGetLastError.

Here is my cuda version when compiling the code

...
-- The CUDA compiler identification is NVIDIA 12.3.107
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- Check for working CUDA compiler: /usr/local/cuda/bin/nvcc - skipped
-- Detecting CUDA compile features
-- Detecting CUDA compile features - done
-- DEVICE -> CUDA
...

There might be a mismatch between current cuda version and the driver version... my guest is that the current cuda version is 12.3 but the current driver on server is most likely compatible with cuda 12.2. If you test on GPU works, my assumption mentioned-above is wrong

jamesgoulet commented 4 months ago

@lhnguyen102 Yes, I have tried and I run in the same issue. I can run GPU for both the C++ code and the pyTAGI_v1. In case you will deprecate the old version once pyTAGI_V1 will be finalize, we can wait for the new one...

@jamesgoulet, I couldn't even run the newest version using GPU on Ubuntu server from the main branch. It gave some weird error... If I recall correctly, everything works just fine on my desktop (I couldn't had access to it now for testing). Could you please test it again? Thanks. Here is my command on terminal

build/main test_fc_mnist

Please make sure uncomment this line in order to run the test on GPU

Screenshot 2024-02-13 at 06 49 56

It works for me both through the Python API or c++

jamesgoulet commented 4 months ago

No worries and understood! @jamesgoulet how CUDA is installed in Ubuntu server? It seems to me that CUDA has not been installed at least for me.

pytagi) lhn@MLCIVS1:~/lhn_tagi/cuTAGI$ nvcc --version
Command 'nvcc' not found, but can be installed with:
sudo apt install nvidia-cuda-toolkit

I could install it if needed but a good practice is to install a version for all users because the size of CUDA installation is large. In addition, I found these instructions are the best to install CUDA on Ubuntu 22.04. I have done that before for @miquelflorensa on my desktop. Note that CUDA version should be 12.2 because of its current Nvidia driver

nvidia-smi
Tue Feb 13 04:01:12 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.154.05             Driver Version: 535.154.05   CUDA Version: 12.2     |

Initially, @miquelflorensa installed it but it seems that it is only installed locally for each user... I had to reinstall it as well following the same instructions you posted.

miquelflorensa commented 4 months ago

In theory CUDA toolkit should be installed globally in the server but it is just that in the binding you need to specify that the version is 12.1. And as you mentioned, there could be a mismatch between CUDA driver version that is 12.2 and CUDA toolkit version that is 12.1.

lhnguyen102 commented 4 months ago

@miquelflorensa Cuda 12.1 works just fine (see @jamesgoulet 's comment). The issue is that the default CUDA version in server is 12.3 which is not compatible with the current driver. Rule of thumbs is that it must either match 12.2 or a bit lower. Version 12.3 appears in my case because I added the following exports to ~/.bashrc. I haven't installed any new CUDA version to the server.

export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH

Will find a workaround to fix it but the long-term solution is to have a unique working CUDA version for all users. For this case, it could be 12.1 or 12.2 (I preferred 12.2). The rest should be removed in order to avoid any other issues

miquelflorensa commented 4 months ago

@lhnguyen102 I see. I removed every CUDA version installed except from 12.1 for now.

lhnguyen102 commented 4 months ago

@miquelflorensa Thank you! It all works now

lhnguyen102 commented 4 months ago

@bhargobdeka I found a quick workaround to your problem. Please replace this line by the following code

ud_idx_batch = np.zeros((batch_size, 1), dtype=np.int32)

The reason is that the original code will create ud_idx_batch as an empty array which is not handled properly in the cuda backend. I will fix this issue in my upcoming PR. Let me know if it works

lhnguyen102 commented 4 months ago

@bhargobdeka it is fine if you want to stick to the old version (I will transfer your code to new version later). It would be great if you can use the newest version which is well organized and pytorch-like API but the catch is that there might be bugs here and there

bhargobdeka commented 4 months ago

I will give it a try tomorrow. Thanks! I created the regression benchmark with the new pytagi version. It seems to work for both cpu and cuda without error but not there yet in terms of results. I will try to see if I did any mistake otherwise I will discuss with you on what all I did.

Bhargob

On Feb 13, 2024, at 7:59 PM, Luong-Ha Nguyen @.***> wrote:

@bhargobdeka https://github.com/bhargobdeka I found a quick workaround to you problem. Please replace this line https://github.com/lhnguyen102/cuTAGI/blob/main/python_examples/regression.py#L237 by the following code

ud_idx_batch = np.zeros((batch_size, 1), dtype=np.int32) The reason is that the original code will create ud_idx_batch as an empty array which will be handled in the upcoming version. I will fix this issue in my upcoming PR. Let me know if it works

β€” Reply to this email directly, view it on GitHub https://github.com/lhnguyen102/cuTAGI/issues/46#issuecomment-1942926181, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQS77FQNEGIAKT46YLE6DSTYTQD7VAVCNFSM6AAAAABC6HNPMKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNBSHEZDMMJYGE. You are receiving this because you were mentioned.

bhargobdeka commented 4 months ago

Yes and I am not good in C++ at all :/

If the old code works, then I can generate all the benchmarks really quick with cuda.

On Feb 13, 2024, at 8:09 PM, Luong-Ha Nguyen @.***> wrote:

@bhargobdeka https://github.com/bhargobdeka it is fine if you want to stick to the old version (I will transfer your code to new version later). It would be great if you can use the newest version https://github.com/lhnguyen102/cuTAGI/blob/main/pytagi/pytagi_v1/test.py which is well organized and pytorch-like API but the catch is that there might be bugs here and there

β€” Reply to this email directly, view it on GitHub https://github.com/lhnguyen102/cuTAGI/issues/46#issuecomment-1942933918, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQS77FQPO6EX3F7JEOSLCHDYTQFEZAVCNFSM6AAAAABC6HNPMKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNBSHEZTGOJRHA. You are receiving this because you were mentioned.

lhnguyen102 commented 4 months ago

Both the old code and the new one has the python API. For the benchmark, you wont need to touch the C++/CUDA part (see the link). Again, it is up to you

bhargobdeka commented 4 months ago

yes, I got that part. I will let you know tomorrow morning if your suggestion worked.

On Feb 13, 2024, at 8:23 PM, Luong-Ha Nguyen @.***> wrote:

Both the old code and the new one has the python API. For the benchmark, you wont need to touch the C++/CUDA part (see the link)

β€” Reply to this email directly, view it on GitHub https://github.com/lhnguyen102/cuTAGI/issues/46#issuecomment-1942944460, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQS77FT44L5QMUKLYSXAIMDYTQGZLAVCNFSM6AAAAABC6HNPMKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNBSHE2DINBWGA. You are receiving this because you were mentioned.

bhargobdeka commented 4 months ago

Hi Ha,

The previous pytagi version works with both cpu and cuda without errors. So, first of all, kudos to that. However, I do not see the advantage in training time with cuda. See the screenshots below. The one with cpu (1.04 sec) is faster than cuda (1.23 sec). Am I doing something wrong?

Let me know if there are additional things that I missed.

Thanks, Bhargob

On Feb 13, 2024, at 8:29 PM, Bhargob Deka @.***> wrote:

yes, I got that part. I will let you know tomorrow morning if your suggestion worked.

On Feb 13, 2024, at 8:23 PM, Luong-Ha Nguyen @. @.>> wrote:

Both the old code and the new one has the python API. For the benchmark, you wont need to touch the C++/CUDA part (see the link)

β€” Reply to this email directly, view it on GitHub https://github.com/lhnguyen102/cuTAGI/issues/46#issuecomment-1942944460, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQS77FT44L5QMUKLYSXAIMDYTQGZLAVCNFSM6AAAAABC6HNPMKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNBSHE2DINBWGA. You are receiving this because you were mentioned.

bhargobdeka commented 4 months ago

However, it is much faster with cuda for larger networks. Here I tried with 2 layers and 1000 hidden units and can clearly see the difference it makes.

On Feb 14, 2024, at 1:48 PM, Bhargob Deka @.***> wrote:

Hi Ha,

The previous pytagi version works with both cpu and cuda without errors. So, first of all, kudos to that. However, I do not see the advantage in training time with cuda. See the screenshots below. The one with cpu (1.04 sec) is faster than cuda (1.23 sec). Am I doing something wrong?

Let me know if there are additional things that I missed.

Thanks, Bhargob

<Screen Shot 2024-02-14 at 1.39.45 PM.png> <Screen Shot 2024-02-14 at 1.40.59 PM.png>

On Feb 13, 2024, at 8:29 PM, Bhargob Deka @. @.>> wrote:

yes, I got that part. I will let you know tomorrow morning if your suggestion worked.

On Feb 13, 2024, at 8:23 PM, Luong-Ha Nguyen @. @.>> wrote:

Both the old code and the new one has the python API. For the benchmark, you wont need to touch the C++/CUDA part (see the link)

β€” Reply to this email directly, view it on GitHub https://github.com/lhnguyen102/cuTAGI/issues/46#issuecomment-1942944460, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQS77FT44L5QMUKLYSXAIMDYTQGZLAVCNFSM6AAAAABC6HNPMKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNBSHE2DINBWGA. You are receiving this because you were mentioned.