Open rahulseetharaman opened 1 year ago
Could you re-run and set the environment variable COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True
? It's possible that you need to erase your torch extensions cache to enable the Torch extension code to compile.
Hi I also get stuck here. I am using the same script. Below is the error
May 04, 01:51:32] Loading decompress_residuals_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...
Process Process-2:
Traceback (most recent call last):
File "/mntnfs/med_data5/zzh/anaconda/envs/colbert/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/mntnfs/med_data5/zzh/anaconda/envs/colbert/lib/python3.8/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/mntnfs/med_data5/zzh/Dream-of-red-chamber/codes/ColBERT/docs/../colbert/infra/launcher.py", line 115, in setup_new_process
return_val = callee(config, *args)
File "/mntnfs/med_data5/zzh/Dream-of-red-chamber/codes/ColBERT/docs/../colbert/indexing/collection_indexer.py", line 33, in encode
encoder.run(shared_lists)
File "/mntnfs/med_data5/zzh/Dream-of-red-chamber/codes/ColBERT/docs/../colbert/indexing/collection_indexer.py", line 67, in run
self.train(shared_lists) # Trains centroids from selected passages
File "/mntnfs/med_data5/zzh/Dream-of-red-chamber/codes/ColBERT/docs/../colbert/indexing/collection_indexer.py", line 225, in train
bucket_cutoffs, bucket_weights, avg_residual = self._compute_avg_residual(centroids, heldout)
File "/mntnfs/med_data5/zzh/Dream-of-red-chamber/codes/ColBERT/docs/../colbert/indexing/collection_indexer.py", line 302, in _compute_avg_residual
compressor = ResidualCodec(config=self.config, centroids=centroids, avg_residual=None)
File "/mntnfs/med_data5/zzh/Dream-of-red-chamber/codes/ColBERT/docs/../colbert/indexing/codecs/residual.py", line 24, in __init__
ResidualCodec.try_load_torch_extensions(self.use_gpu)
File "/mntnfs/med_data5/zzh/Dream-of-red-chamber/codes/ColBERT/docs/../colbert/indexing/codecs/residual.py", line 103, in try_load_torch_extensions
decompress_residuals_cpp = load(
File "/mntnfs/med_data5/zzh/anaconda/envs/colbert/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1202, in load
return _jit_compile(
File "/mntnfs/med_data5/zzh/anaconda/envs/colbert/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1450, in _jit_compile
return _import_module_from_library(name, build_directory, is_python_module)
File "/mntnfs/med_data5/zzh/anaconda/envs/colbert/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1844, in _import_module_from_library
module = importlib.util.module_from_spec(spec)
File "<frozen importlib._bootstrap>", line 556, in module_from_spec
File "<frozen importlib._bootstrap_external>", line 1166, in create_module
File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
ImportError: /home/zzh/.cache/torch_extensions/py38_cu113/decompress_residuals_cpp/decompress_residuals_cpp.so: cannot open shared object file: No such file or directory
Is there anyone who could help me solve this bug? I have been stuck here for three days.
What infrastructure are you using to run this?
and have you tried setting COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True
Hi, okhat. Thank for your response!
I run this on a single RTX2080 gpu in linux. I am not very sure what other information you need, so leave me a message if you want more.
I built an environment following conda_env.yml, but the cuda version of my machine is cuda 11.3. So I am using torch-1.12.1+cu113-cp38 and other packages that may not strictly follow the corresponding versions in conda_env.yml.
Below is the same error when setting COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True.
[May 04, 03:18:56] Loading decompress_residuals_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...
Using /home/zzh/.cache/torch_extensions/py38_cu113 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/zzh/.cache/torch_extensions/py38_cu113/decompress_residuals_cpp/build.ninja...
Building extension module decompress_residuals_cpp...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
1.11.1.git.kitware.jobserver-1
Loading extension module decompress_residuals_cpp...
Process Process-2:
Traceback (most recent call last):
File "/mntnfs/med_data5/zzh/anaconda/envs/colbert/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/mntnfs/med_data5/zzh/anaconda/envs/colbert/lib/python3.8/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/mntnfs/med_data5/zzh/Dream-of-red-chamber/codes/ColBERT/docs/../colbert/infra/launcher.py", line 115, in setup_new_process
return_val = callee(config, *args)
File "/mntnfs/med_data5/zzh/Dream-of-red-chamber/codes/ColBERT/docs/../colbert/indexing/collection_indexer.py", line 33, in encode
encoder.run(shared_lists)
File "/mntnfs/med_data5/zzh/Dream-of-red-chamber/codes/ColBERT/docs/../colbert/indexing/collection_indexer.py", line 67, in run
self.train(shared_lists) # Trains centroids from selected passages
File "/mntnfs/med_data5/zzh/Dream-of-red-chamber/codes/ColBERT/docs/../colbert/indexing/collection_indexer.py", line 225, in train
bucket_cutoffs, bucket_weights, avg_residual = self._compute_avg_residual(centroids, heldout)
File "/mntnfs/med_data5/zzh/Dream-of-red-chamber/codes/ColBERT/docs/../colbert/indexing/collection_indexer.py", line 302, in _compute_avg_residual
compressor = ResidualCodec(config=self.config, centroids=centroids, avg_residual=None)
File "/mntnfs/med_data5/zzh/Dream-of-red-chamber/codes/ColBERT/docs/../colbert/indexing/codecs/residual.py", line 24, in __init__
ResidualCodec.try_load_torch_extensions(self.use_gpu)
File "/mntnfs/med_data5/zzh/Dream-of-red-chamber/codes/ColBERT/docs/../colbert/indexing/codecs/residual.py", line 103, in try_load_torch_extensions
decompress_residuals_cpp = load(
File "/mntnfs/med_data5/zzh/anaconda/envs/colbert/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1202, in load
return _jit_compile(
File "/mntnfs/med_data5/zzh/anaconda/envs/colbert/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1450, in _jit_compile
return _import_module_from_library(name, build_directory, is_python_module)
File "/mntnfs/med_data5/zzh/anaconda/envs/colbert/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1844, in _import_module_from_library
module = importlib.util.module_from_spec(spec)
File "<frozen importlib._bootstrap>", line 556, in module_from_spec
File "<frozen importlib._bootstrap_external>", line 1166, in create_module
File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
ImportError: /home/zzh/.cache/torch_extensions/py38_cu113/decompress_residuals_cpp/decompress_residuals_cpp.so: cannot open shared object file: No such file or directory
Thank you again!
Can you try removing this folder and running again?
/home/zzh/.cache/torch_extensions/py38_cu113
Let us know if this helps
Thank you for your advice.
I actually have tried removing the folder /home/zzh/.cache/torch_extensions/py38_cu113
and running again several times before I asked your help. I am so sorry that this method does not solve the error I am faced with.
Best wishes!
I also get stuck here, have you solved it yet?
Try reducing the number of passages youre indexing? This code has been working for a long time so we should look into whether some of the dependencies have broken somehow
I run into the same issue yesterday (regardless of number of passages) but removing py38_cu113
from .cache
solved it for me.
I tested with small (10K) and large number of passages (.6M) and both worked fine. The issue appeared to originate by interrupting the process during an initial test run. This resulted in some IO hanging in the attempt to load the output files, which resulted in no errors even when debugging was turned on. Setup is RTX2080 on Ubuntu 20.04 LTS so quite comparable to past reported issues. I didn't have to re-do the conda environment but that might also help in some cases, or even trying the cpu only.
Try reducing the number of passages youre indexing? This code has been working for a long time so we should look into whether some of the dependencies have broken somehow
I also faced this problem and have tried to remove the torch_extensions, but still got stuck here. Do you have any solutions? Best Wishes!
This is my error log:
[Jun 11, 13:37:02] [0] # of sampled PIDs = 3633 sampled_pids[:3] = [1706, 3001, 41]
[Jun 11, 13:37:02] [0] #> Encoding 3633 passages..
[Jun 11, 13:37:10] [0] avg_doclen_est = 234.9237518310547 len(local_sample) = 3,633
[Jun 11, 13:37:10] [0] Creaing 8,192 partitions.
[Jun 11, 13:37:10] [0] Estimated 853,477 embeddings.
[Jun 11, 13:37:10] [0] #> Saving the indexing plan to /home/zhangtong/ColBERT/experiments/nfcorpus/indexes/nfcorpus.2bits/plan.json ..
Clustering 810805 points in 128D to 8192 clusters, redo 1 times, 20 iterations
Preprocessing in 0.10 s
Iteration 19 (15.56 s, search 15.10 s): objective=183529 imbalance=1.361 nsplit=0
[Jun 11, 13:37:26] Loading decompress_residuals_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...
Using /home/zhangtong/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
Process Process-2:
Traceback (most recent call last):
File "/home/zhangtong/anaconda3/envs/colbert/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/home/zhangtong/anaconda3/envs/colbert/lib/python3.8/multiprocessing/process.py", line 108, in run
self._target(*self._args, *self._kwargs)
File "/home/zhangtong/ColBERT/colbert/infra/launcher.py", line 115, in setup_new_process
return_val = callee(config, args)
File "/home/zhangtong/ColBERT/colbert/indexing/collection_indexer.py", line 33, in encode
encoder.run(shared_lists)
File "/home/zhangtong/ColBERT/colbert/indexing/collection_indexer.py", line 67, in run
self.train(shared_lists) # Trains centroids from selected passages
File "/home/zhangtong/ColBERT/colbert/indexing/collection_indexer.py", line 225, in train
bucket_cutoffs, bucket_weights, avg_residual = self._compute_avg_residual(centroids, heldout)
File "/home/zhangtong/ColBERT/colbert/indexing/collection_indexer.py", line 302, in _compute_avg_residual
compressor = ResidualCodec(config=self.config, centroids=centroids, avg_residual=None)
File "/home/zhangtong/ColBERT/colbert/indexing/codecs/residual.py", line 24, in init
ResidualCodec.try_load_torch_extensions(self.use_gpu)
File "/home/zhangtong/ColBERT/colbert/indexing/codecs/residual.py", line 103, in try_load_torch_extensions
decompress_residuals_cpp = load(
File "/home/zhangtong/anaconda3/envs/colbert/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1284, in load
return _jit_compile(
File "/home/zhangtong/anaconda3/envs/colbert/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1508, in _jit_compile
_write_ninja_file_and_build_library(
File "/home/zhangtong/anaconda3/envs/colbert/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1592, in _write_ninja_file_and_build_library
verify_ninja_availability()
File "/home/zhangtong/anaconda3/envs/colbert/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1648, in verify_ninja_availability
raise RuntimeError("Ninja is required to load C++ extensions")
RuntimeError: Ninja is required to load C++ extensions
Does it mean the version of Ninja is not suitable?
Hi! Facing the same issue as above: RuntimeError: Ninja is required to load C++ extensions
你好! 面临与上面相同的问题: RuntimeError: Ninja is required to load C++ extensions
I tried all the methods but couldn't solve this error, but it can run only on the CPU
Hi @zt991211 @palm2333 @zzhheloise, putting a comment here with what worked for me, so if you're still facing the issue I was, maybe this can help.
My issue turned out to be an environmental one, and that is in line with other people not facing such issues. I tried this on a couple of environments and am currently on WSL2.
Basically, my problem was with CUDA. As per my understanding, the cuda toolkit set up in the conda environment provided with this repo with pytorch comes with some needed runtime libraries, but the code requires nvcc
as well. So what worked for me (In WSL2 on Windows 10) was installing cudatoolkit-dev via conda forge:
conda install -c conda-forge cudatoolkit-dev
Again, this is just what seemed to have worked for me.
Hope this is okay @okhat
Thanks!
Folks who are still facing issues can use this Google Colab: https://colab.research.google.com/github/stanford-futuredata/ColBERT/blob/main/docs/intro2new.ipynb
I also faced this same issue consistent with the case after the sub-processes crash and are left in a zombie state requiring manual killing. Deleting the cache fixed for me
I recently encountered a similar issue. I checked the GPU version of torch and other packages corresponding to the CUDA version, upgraded my GCC version.
And I tried several methods, and I'm not certain which one resolved it:
sudo apt update
sudo apt install build-essential
sudo apt-get install ninja-build
conda install -c conda-forge cudatoolkit-dev
rm -rf /root/.cache/torch_extensions/py38_cu113
Cache files are generated every time they run, so delete them before running.
I hope this helps anyone facing the same problem.
@okhat Hi, I am running colbert with following configuration on a single GPU.
Following is the script I am using. I just wanted to see if I could run a quick indexing end to end.
However the indexing seems to be stuck at this point.
I am running on a Quadro RTX 8000 (49GB) and 128GB RAM machine.