state-spaces / s4

Structured state space sequence models
Apache License 2.0
2.4k stars 285 forks source link

Error after installing CUDA extension for Cauchy multiplication #62

Open gitbooo opened 2 years ago

gitbooo commented 2 years ago

I'm trying to reproduce experiments but the code is retuning a KeyError 'nvrtc' and the warning [src.models.sequence.ss.kernel][WARNING] - CUDA extension for Cauchy multiplication not found still appearing.

Otherwise, I'm getting this erreur : Fatal error condition occurred in /opt/vcpkg/buildtrees/aws-c-io/src/9e6648842a-364b708815.clean/source/event_loop.c:72: aws_thread_launch(&cleanup_thread, s_event_loop_destroy_async_thread_fn, el_group, &thread_options) == AWS_OP_SUCCESS

albertfgu commented 2 years ago

Can you elaborate on the 'nvrtc' error? Can you uninstall and reinstall the extension (pip uninstall cauchy-mult and cd extensions/cauchy && python setup.py install) and copy what it prints?

Does the code run if you completely uninstall the extension? What about if you install pykeops?

gitbooo commented 2 years ago

After doing multiple tests, I realized that the cauchy extension is not the problem (although it is strange that even after installing the extension, the code still returns "CUDA extension for cauchy multiplication not found"), but it is the second error that I cannot resolve:

Fatal error condition occurred in /opt/vcpkg/buildtrees/aws-c-io/src/9e6648842a-364b708815.clean/source/event_loop.c:72: aws_thread_launch(&cleanup_thread, s_event_loop_destroy_async_thread_fn, el_group, &thread_options) == AWS_OP_SUCCESS
Exiting Application
################################################################################
Stack trace:
################################################################################
/home/"""/.conda/envs/s4env/lib/python3.10/site-packages/pyarrow/libarrow.so.900(+0x200af06) [0x7f8411eaaf06]
/home/"""/.conda/envs/s4env/lib/python3.10/site-packages/pyarrow/libarrow.so.900(+0x20028e5) [0x7f8411ea28e5]
/home/"""/.conda/envs/s4env/lib/python3.10/site-packages/pyarrow/libarrow.so.900(+0x1f27e09) [0x7f8411dc7e09]
/home/"""/.conda/envs/s4env/lib/python3.10/site-packages/pyarrow/libarrow.so.900(+0x200ba3d) [0x7f8411eaba3d]
/home/"""/.conda/envs/s4env/lib/python3.10/site-packages/pyarrow/libarrow.so.900(+0x1f25948) [0x7f8411dc5948]
/home/"""/.conda/envs/s4env/lib/python3.10/site-packages/pyarrow/libarrow.so.900(+0x200ba3d) [0x7f8411eaba3d]
/home/"""/.conda/envs/s4env/lib/python3.10/site-packages/pyarrow/libarrow.so.900(+0x1ee0b46) [0x7f8411d80b46]
/home/"""/.conda/envs/s4env/lib/python3.10/site-packages/pyarrow/libarrow.so.900(+0x194546a) [0x7f84117e546a]
/lib/x86_64-linux-gnu/libc.so.6(+0x43161) [0x7f84893fb161]
/lib/x86_64-linux-gnu/libc.so.6(+0x4325a) [0x7f84893fb25a]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xee) [0x7f84893d9bfe]
python(+0x2125d4) [0x564a6c3b15d4]
/var/spool/slurmd/job2192901/slurm_script: line 18: 17719 Aborted
albertfgu commented 2 years ago

I haven't seen this error before. Just to confirm, this happens even with the extension uninstalled? Does your environment work with other codebases? Outside of the extension, there is nothing fancy with requirements for this repository.

gitbooo commented 2 years ago

Yeah, the extension is not installed. however I'm getting this error at the end of training after epoch 9 is finished.

danassou commented 2 years ago

Hi, I also have the same error, at the end of the training (running python -m train experiment=forecasting/s4-informer-{etth,ettm,ecl,weather} ) :

Epoch 9: 100%|█▉| 1510/1511 [00:24<00:00, 62.27it/s, loss=0.0216, v_num=pbmZ, val/mse=0.421, val/loss=0.421, test/mse=0.266, test/loss=0.266, train/mse=0.0242, train/loss=0.0242Epoch 9, global step 4809: 'val/loss' was not in top 1                                                                                                                             
Epoch 9: 100%|██| 1511/1511 [00:24<00:00, 62.14it/s, loss=0.0216, v_num=pbmZ, val/mse=0.421, val/loss=0.421, test/mse=0.266, test/loss=0.266, train/mse=0.0231, train/loss=0.0231]
Fatal error condition occurred in /opt/vcpkg/buildtrees/aws-c-io/src/9e6648842a-364b708815.clean/source/event_loop.c:72: aws_thread_launch(&cleanup_thread, s_event_loop_destroy_async_thread_fn, el_group, &thread_options) == AWS_OP_SUCCESS
Exiting Application
################################################################################
Stack trace:
################################################################################
/home/.conda/envs/supergood_env_cluster/lib/python3.9/site-packages/pyarrow/libarrow.so.900(+0x200af06) [0x7fbaccc77f06]
/home/.conda/envs/supergood_env_cluster/lib/python3.9/site-packages/pyarrow/libarrow.so.900(+0x20028e5) [0x7fbaccc6f8e5]
/home/.conda/envs/supergood_env_cluster/lib/python3.9/site-packages/pyarrow/libarrow.so.900(+0x1f27e09) [0x7fbaccb94e09]
/home/.conda/envs/supergood_env_cluster/lib/python3.9/site-packages/pyarrow/libarrow.so.900(+0x200ba3d) [0x7fbaccc78a3d]
/home/.conda/envs/supergood_env_cluster/lib/python3.9/site-packages/pyarrow/libarrow.so.900(+0x1f25948) [0x7fbaccb92948]
/home/.conda/envs/supergood_env_cluster/lib/python3.9/site-packages/pyarrow/libarrow.so.900(+0x200ba3d) [0x7fbaccc78a3d]
/home/.conda/envs/supergood_env_cluster/lib/python3.9/site-packages/pyarrow/libarrow.so.900(+0x1ee0b46) [0x7fbaccb4db46]
/home/.conda/envs/supergood_env_cluster/lib/python3.9/site-packages/pyarrow/libarrow.so.900(+0x194546a) [0x7fbacc5b246a]
/lib/x86_64-linux-gnu/libc.so.6(+0x43031) [0x7fbb3f73f031]
/lib/x86_64-linux-gnu/libc.so.6(+0x4312a) [0x7fbb3f73f12a]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xee) [0x7fbb3f71dc8e]
python(+0x2010a0) [0x56320cea20a0]
Aborted (core dumped)

So it does train, but this is a strange ending assertion error. After looking around, it seems that it is an error that is found by many people regarding aws-sdk-cpp, for example you can find it here: https://github.com/huggingface/datasets/issues/3310

albertfgu commented 2 years ago

Thanks for the additional info! Does this error occur if you uninstall the datasets package then? Does it only happen with AWS?

danassou commented 2 years ago

I can't run the code without the datasets library since it's required - I'm getting no module found error if I do so. To clarify, I'm not running my code with AWS, I am using my university's cluster (I don't really understand why aws-related errors pop up to be honest!)

albertfgu commented 2 years ago

You should be able to remove the dataset dependency by deleting the "lra" import from src/dataloaders/__init__.py

gitbooo commented 2 years ago

You should be able to remove the dataset dependency by deleting the "lra" import from src/dataloaders/__init__.py

The code seems working on CPU without errors. however, I getting a KeyError 'nvrtc' with pykeops installed. Can you provide us with the pykeops version that you are using?

albertfgu commented 2 years ago
  1. Does it run when pykeops is uninstalled?
  2. Are you able to install the CUDA extension instead?
  3. Can you try version pip install pykeops==1.5? Later versions of pykeops sometimes cause installations errors for me.
  4. What happens if you follow the instructions on the pykeops page for testing the installation?
gitbooo commented 2 years ago
farshchian commented 2 years ago

I am also facing the exact same issue. @gitbooo have you found a solution?

albertfgu commented 2 years ago
  1. Without pykeops, the code should still run on GPU. Is there a reason you can only use CPU?
  2. I don't know why the extension isn't working. One note is that it has to be installed for every environment (e.g. for different GPU, CUDA version, etc.). E.g. it doesn't work if different machines are sharing conda environments; you would need to create a separate conda environment for each environment type and install the extension in each one
  3. I've seen that message several times in the past and I think it was always caused by an improper install. Installing from a fresh environment and also installing the latest version of cmake was the solution (pip install pykeops==1.5 cmake)
  4. Were you able to comment out the datasets dependency? It should involve changing one line of code in src/dataloaders/__init__.py