Closed bhargobdeka closed 4 months ago
Will try to create the same for the new pyTAGI as it seems to be running well on both cpu and CUDA.
@miquelflorensa and @jamesgoulet are you facing the same issue? I am off the grid for a couple of days. I will address the issue in a couple of days if it won't be resolved
@lhnguyen102 Yes, I have tried and I run in the same issue. I can run GPU for both the C++ code and the pyTAGI_v1. In case you will deprecate the old version once pyTAGI_V1 will be finalize, we can wait for the new one...
I will take a look at it for a quick fix. In general, you don't really need the CUDA for two fully-connected layer of 50 hidden units.
ππ. I think that @bhargobdeka idea was to create config files for the large's UCI regression dataset so that we can post a repeatable baseline.
Hi Ha,
Hoping I am disturbing you in the middle of your vacation :-)
It will be nice to have all the benchmark results available in the Pytagi package with both cpu and CUDA. I will try to put the other methods, e..g., deep ensemble, MC dropout, PBP, Laplace etc in the python package also so that we have a holistic comparison for the regression datasets, both small and large. The large ones need to run with CUDA as it will take forever otherwise. I will also try to see how to fix the CUDA problem together with Miguel.
Regards, Bhargob Deka Postdoctoral Researcher Department of Civil, Geologic, and Mining Polytechnique MontrΓ©al, QC @.*** LinkedIn: https://www.linkedin.com/in/bhargob-deka-ph-d-87aba7117/
On Feb 12, 2024, at 6:10 AM, James-A. Goulet @.***> wrote:
ππ. I think that @bhargobdeka https://github.com/bhargobdeka idea was to create config files for the large's UCI regression dataset so that we can post a repeatable baseline.
β Reply to this email directly, view it on GitHub https://github.com/lhnguyen102/cuTAGI/issues/46#issuecomment-1938472174, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQS77FTGUBYTGCLQERZAFKTYTH2D7AVCNFSM6AAAAABC6HNPMKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMZYGQ3TEMJXGQ. You are receiving this because you were mentioned.
No worries and understood! @jamesgoulet how CUDA is installed in Ubuntu server? It seems to me that CUDA has not been installed at least for me.
pytagi) lhn@MLCIVS1:~/lhn_tagi/cuTAGI$ nvcc --version
Command 'nvcc' not found, but can be installed with:
sudo apt install nvidia-cuda-toolkit
I could install it if needed but a good practice is to install a version for all users because the size of CUDA installation is large. In addition, I found these instructions are the best to install CUDA on Ubuntu 22.04. I have done that before for @miquelflorensa on my desktop. Note that CUDA version should be 12.2 because of its current Nvidia driver
nvidia-smi
Tue Feb 13 04:01:12 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.154.05 Driver Version: 535.154.05 CUDA Version: 12.2 |
Nevermind! I figured it out
@lhnguyen102 Yes, I have tried and I run in the same issue. I can run GPU for both the C++ code and the pyTAGI_v1. In case you will deprecate the old version once pyTAGI_V1 will be finalize, we can wait for the new one...
@jamesgoulet, I couldn't even run the newest version using GPU on Ubuntu server from the main branch. It gave some weird error... If I recall correctly, everything works just fine on my desktop (I couldn't had access to it now for testing). Could you please test it again? Thanks. Here is my command on terminal
build/main test_fc_mnist
Please make sure uncomment this line in order to run the test on GPU
I might have a hint. Here is the error
Program hit cudaErrorUnsupportedPtxVersion (error 222) due to "the provided PTX was compiled with an unsupported toolchain." on CUDA API call to cudaGetLastError.
Here is my cuda version when compiling the code
...
-- The CUDA compiler identification is NVIDIA 12.3.107
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- Check for working CUDA compiler: /usr/local/cuda/bin/nvcc - skipped
-- Detecting CUDA compile features
-- Detecting CUDA compile features - done
-- DEVICE -> CUDA
...
There might be a mismatch between current cuda version and the driver version... my guest is that the current cuda version is 12.3 but the current driver on server is most likely compatible with cuda 12.2. If you test on GPU works, my assumption mentioned-above is wrong
@lhnguyen102 Yes, I have tried and I run in the same issue. I can run GPU for both the C++ code and the pyTAGI_v1. In case you will deprecate the old version once pyTAGI_V1 will be finalize, we can wait for the new one...
@jamesgoulet, I couldn't even run the newest version using GPU on Ubuntu server from the main branch. It gave some weird error... If I recall correctly, everything works just fine on my desktop (I couldn't had access to it now for testing). Could you please test it again? Thanks. Here is my command on terminal
build/main test_fc_mnist
Please make sure uncomment this line in order to run the test on GPU
It works for me both through the Python API or c++
No worries and understood! @jamesgoulet how CUDA is installed in Ubuntu server? It seems to me that CUDA has not been installed at least for me.
pytagi) lhn@MLCIVS1:~/lhn_tagi/cuTAGI$ nvcc --version Command 'nvcc' not found, but can be installed with: sudo apt install nvidia-cuda-toolkit
I could install it if needed but a good practice is to install a version for all users because the size of CUDA installation is large. In addition, I found these instructions are the best to install CUDA on Ubuntu 22.04. I have done that before for @miquelflorensa on my desktop. Note that CUDA version should be 12.2 because of its current Nvidia driver
nvidia-smi Tue Feb 13 04:01:12 2024 +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.154.05 Driver Version: 535.154.05 CUDA Version: 12.2 |
Initially, @miquelflorensa installed it but it seems that it is only installed locally for each user... I had to reinstall it as well following the same instructions you posted.
In theory CUDA toolkit should be installed globally in the server but it is just that in the binding you need to specify that the version is 12.1. And as you mentioned, there could be a mismatch between CUDA driver version that is 12.2 and CUDA toolkit version that is 12.1.
@miquelflorensa Cuda 12.1 works just fine (see @jamesgoulet 's comment). The issue is that the default CUDA version in server is 12.3 which is not compatible with the current driver. Rule of thumbs is that it must either match 12.2 or a bit lower. Version 12.3 appears in my case because I added the following exports to ~/.bashrc
. I haven't installed any new CUDA version to the server.
export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
Will find a workaround to fix it but the long-term solution is to have a unique working CUDA version for all users. For this case, it could be 12.1 or 12.2 (I preferred 12.2). The rest should be removed in order to avoid any other issues
@lhnguyen102 I see. I removed every CUDA version installed except from 12.1 for now.
@miquelflorensa Thank you! It all works now
@bhargobdeka I found a quick workaround to your problem. Please replace this line by the following code
ud_idx_batch = np.zeros((batch_size, 1), dtype=np.int32)
The reason is that the original code will create ud_idx_batch
as an empty array which is not handled properly in the cuda backend. I will fix this issue in my upcoming PR. Let me know if it works
@bhargobdeka it is fine if you want to stick to the old version (I will transfer your code to new version later). It would be great if you can use the newest version which is well organized and pytorch-like API but the catch is that there might be bugs here and there
I will give it a try tomorrow. Thanks! I created the regression benchmark with the new pytagi version. It seems to work for both cpu and cuda without error but not there yet in terms of results. I will try to see if I did any mistake otherwise I will discuss with you on what all I did.
Bhargob
On Feb 13, 2024, at 7:59 PM, Luong-Ha Nguyen @.***> wrote:
@bhargobdeka https://github.com/bhargobdeka I found a quick workaround to you problem. Please replace this line https://github.com/lhnguyen102/cuTAGI/blob/main/python_examples/regression.py#L237 by the following code
ud_idx_batch = np.zeros((batch_size, 1), dtype=np.int32) The reason is that the original code will create ud_idx_batch as an empty array which will be handled in the upcoming version. I will fix this issue in my upcoming PR. Let me know if it works
β Reply to this email directly, view it on GitHub https://github.com/lhnguyen102/cuTAGI/issues/46#issuecomment-1942926181, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQS77FQNEGIAKT46YLE6DSTYTQD7VAVCNFSM6AAAAABC6HNPMKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNBSHEZDMMJYGE. You are receiving this because you were mentioned.
Yes and I am not good in C++ at all :/
If the old code works, then I can generate all the benchmarks really quick with cuda.
On Feb 13, 2024, at 8:09 PM, Luong-Ha Nguyen @.***> wrote:
@bhargobdeka https://github.com/bhargobdeka it is fine if you want to stick to the old version (I will transfer your code to new version later). It would be great if you can use the newest version https://github.com/lhnguyen102/cuTAGI/blob/main/pytagi/pytagi_v1/test.py which is well organized and pytorch-like API but the catch is that there might be bugs here and there
β Reply to this email directly, view it on GitHub https://github.com/lhnguyen102/cuTAGI/issues/46#issuecomment-1942933918, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQS77FQPO6EX3F7JEOSLCHDYTQFEZAVCNFSM6AAAAABC6HNPMKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNBSHEZTGOJRHA. You are receiving this because you were mentioned.
Both the old code and the new one has the python API. For the benchmark, you wont need to touch the C++/CUDA part (see the link). Again, it is up to you
yes, I got that part. I will let you know tomorrow morning if your suggestion worked.
On Feb 13, 2024, at 8:23 PM, Luong-Ha Nguyen @.***> wrote:
Both the old code and the new one has the python API. For the benchmark, you wont need to touch the C++/CUDA part (see the link)
β Reply to this email directly, view it on GitHub https://github.com/lhnguyen102/cuTAGI/issues/46#issuecomment-1942944460, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQS77FT44L5QMUKLYSXAIMDYTQGZLAVCNFSM6AAAAABC6HNPMKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNBSHE2DINBWGA. You are receiving this because you were mentioned.
Hi Ha,
The previous pytagi version works with both cpu and cuda without errors. So, first of all, kudos to that. However, I do not see the advantage in training time with cuda. See the screenshots below. The one with cpu (1.04 sec) is faster than cuda (1.23 sec). Am I doing something wrong?
Let me know if there are additional things that I missed.
Thanks, Bhargob
On Feb 13, 2024, at 8:29 PM, Bhargob Deka @.***> wrote:
yes, I got that part. I will let you know tomorrow morning if your suggestion worked.
On Feb 13, 2024, at 8:23 PM, Luong-Ha Nguyen @. @.>> wrote:
Both the old code and the new one has the python API. For the benchmark, you wont need to touch the C++/CUDA part (see the link)
β Reply to this email directly, view it on GitHub https://github.com/lhnguyen102/cuTAGI/issues/46#issuecomment-1942944460, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQS77FT44L5QMUKLYSXAIMDYTQGZLAVCNFSM6AAAAABC6HNPMKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNBSHE2DINBWGA. You are receiving this because you were mentioned.
However, it is much faster with cuda for larger networks. Here I tried with 2 layers and 1000 hidden units and can clearly see the difference it makes.
On Feb 14, 2024, at 1:48 PM, Bhargob Deka @.***> wrote:
Hi Ha,
The previous pytagi version works with both cpu and cuda without errors. So, first of all, kudos to that. However, I do not see the advantage in training time with cuda. See the screenshots below. The one with cpu (1.04 sec) is faster than cuda (1.23 sec). Am I doing something wrong?
Let me know if there are additional things that I missed.
Thanks, Bhargob
<Screen Shot 2024-02-14 at 1.39.45 PM.png> <Screen Shot 2024-02-14 at 1.40.59 PM.png>
On Feb 13, 2024, at 8:29 PM, Bhargob Deka @. @.>> wrote:
yes, I got that part. I will let you know tomorrow morning if your suggestion worked.
On Feb 13, 2024, at 8:23 PM, Luong-Ha Nguyen @. @.>> wrote:
Both the old code and the new one has the python API. For the benchmark, you wont need to touch the C++/CUDA part (see the link)
β Reply to this email directly, view it on GitHub https://github.com/lhnguyen102/cuTAGI/issues/46#issuecomment-1942944460, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQS77FT44L5QMUKLYSXAIMDYTQGZLAVCNFSM6AAAAABC6HNPMKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNBSHE2DINBWGA. You are receiving this because you were mentioned.
It says that data could not be transferred to device. Let me now if there is a fix around this. Thanks!