stanford-futuredata / ColBERT

ColBERT: state-of-the-art neural search (SIGIR'20, TACL'21, NeurIPS'21, NAACL'22, CIKM'22, ACL'23, EMNLP'23)
MIT License
2.94k stars 376 forks source link

Model stuck in Loading decompress_residuals_cpp extension #195

Open rahulseetharaman opened 1 year ago

rahulseetharaman commented 1 year ago

@okhat Hi, I am running colbert with following configuration on a single GPU.

Following is the script I am using. I just wanted to see if I could run a quick indexing end to end.

import os
import sys
sys.path.insert(0, '../')

from colbert.infra import Run, RunConfig, ColBERTConfig
from colbert.data import Queries, Collection
from colbert import Indexer, Searcher

if __name__ == '__main__':
    dataroot = 'downloads/lotte'
    dataset = 'lifestyle'
    datasplit = 'dev'

    queries = os.path.join(dataroot, dataset, datasplit, 'questions.search.tsv')
    collection = os.path.join(dataroot, dataset, datasplit, 'collection.tsv')

    queries = Queries(path=queries)
    collection = Collection(path=collection)

    f'Loaded {len(queries)} queries and {len(collection):,} passages'

    print(queries[24])
    print()
    print(collection[89852])
    print()

    nbits = 2   # encode each dimension with 2 bits
    doc_maxlen = 300   # truncate passages at 300 tokens

    checkpoint = 'downloads/colbertv2.0'
    index_name = f'{dataset}.{datasplit}.{nbits}bits'

    with Run().context(RunConfig(nranks=1, experiment='msmarco')):  # nranks specifies the number of GPUs to use.
        config = ColBERTConfig(doc_maxlen=doc_maxlen, nbits=nbits)

        indexer = Indexer(checkpoint=checkpoint, config=config)
        indexer.index(name=index_name, collection=collection[:20], overwrite=True)

    print(indexer.get_index()) # You can get the absolute path of the index, if needed.

However the indexing seems to be stuck at this point.

WARNING clustering 2687 points to 512 centroids: please provide at least 19968 training points
Clustering 2687 points in 128D to 512 clusters, redo 1 times, 20 iterations
  Preprocessing in 0.00 s
  Iteration 19 (1.10 s, search 0.23 s): objective=463.491 imbalance=1.616 nsplit=0
[Apr 22, 02:32:04] Loading decompress_residuals_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...

I am running on a Quadro RTX 8000 (49GB) and 128GB RAM machine.

santhnm2 commented 1 year ago

Could you re-run and set the environment variable COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True? It's possible that you need to erase your torch extensions cache to enable the Torch extension code to compile.

zzhheloise commented 1 year ago

Hi I also get stuck here. I am using the same script. Below is the error

May 04, 01:51:32] Loading decompress_residuals_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...
Process Process-2:
Traceback (most recent call last):
  File "/mntnfs/med_data5/zzh/anaconda/envs/colbert/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/mntnfs/med_data5/zzh/anaconda/envs/colbert/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/mntnfs/med_data5/zzh/Dream-of-red-chamber/codes/ColBERT/docs/../colbert/infra/launcher.py", line 115, in setup_new_process
    return_val = callee(config, *args)
  File "/mntnfs/med_data5/zzh/Dream-of-red-chamber/codes/ColBERT/docs/../colbert/indexing/collection_indexer.py", line 33, in encode
    encoder.run(shared_lists)
  File "/mntnfs/med_data5/zzh/Dream-of-red-chamber/codes/ColBERT/docs/../colbert/indexing/collection_indexer.py", line 67, in run
    self.train(shared_lists) # Trains centroids from selected passages
  File "/mntnfs/med_data5/zzh/Dream-of-red-chamber/codes/ColBERT/docs/../colbert/indexing/collection_indexer.py", line 225, in train
    bucket_cutoffs, bucket_weights, avg_residual = self._compute_avg_residual(centroids, heldout)
  File "/mntnfs/med_data5/zzh/Dream-of-red-chamber/codes/ColBERT/docs/../colbert/indexing/collection_indexer.py", line 302, in _compute_avg_residual
    compressor = ResidualCodec(config=self.config, centroids=centroids, avg_residual=None)
  File "/mntnfs/med_data5/zzh/Dream-of-red-chamber/codes/ColBERT/docs/../colbert/indexing/codecs/residual.py", line 24, in __init__
    ResidualCodec.try_load_torch_extensions(self.use_gpu)
  File "/mntnfs/med_data5/zzh/Dream-of-red-chamber/codes/ColBERT/docs/../colbert/indexing/codecs/residual.py", line 103, in try_load_torch_extensions
    decompress_residuals_cpp = load(
  File "/mntnfs/med_data5/zzh/anaconda/envs/colbert/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1202, in load
    return _jit_compile(
  File "/mntnfs/med_data5/zzh/anaconda/envs/colbert/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1450, in _jit_compile
    return _import_module_from_library(name, build_directory, is_python_module)
  File "/mntnfs/med_data5/zzh/anaconda/envs/colbert/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1844, in _import_module_from_library
    module = importlib.util.module_from_spec(spec)
  File "<frozen importlib._bootstrap>", line 556, in module_from_spec
  File "<frozen importlib._bootstrap_external>", line 1166, in create_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
ImportError: /home/zzh/.cache/torch_extensions/py38_cu113/decompress_residuals_cpp/decompress_residuals_cpp.so: cannot open shared object file: No such file or directory

Is there anyone who could help me solve this bug? I have been stuck here for three days.

okhat commented 1 year ago

What infrastructure are you using to run this?

okhat commented 1 year ago

and have you tried setting COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True

zzhheloise commented 1 year ago

Hi, okhat. Thank for your response!

I run this on a single RTX2080 gpu in linux. I am not very sure what other information you need, so leave me a message if you want more.

I built an environment following conda_env.yml, but the cuda version of my machine is cuda 11.3. So I am using torch-1.12.1+cu113-cp38 and other packages that may not strictly follow the corresponding versions in conda_env.yml.

Below is the same error when setting COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True.

[May 04, 03:18:56] Loading decompress_residuals_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...
Using /home/zzh/.cache/torch_extensions/py38_cu113 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/zzh/.cache/torch_extensions/py38_cu113/decompress_residuals_cpp/build.ninja...
Building extension module decompress_residuals_cpp...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
1.11.1.git.kitware.jobserver-1
Loading extension module decompress_residuals_cpp...
Process Process-2:
Traceback (most recent call last):
  File "/mntnfs/med_data5/zzh/anaconda/envs/colbert/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/mntnfs/med_data5/zzh/anaconda/envs/colbert/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/mntnfs/med_data5/zzh/Dream-of-red-chamber/codes/ColBERT/docs/../colbert/infra/launcher.py", line 115, in setup_new_process
    return_val = callee(config, *args)
  File "/mntnfs/med_data5/zzh/Dream-of-red-chamber/codes/ColBERT/docs/../colbert/indexing/collection_indexer.py", line 33, in encode
    encoder.run(shared_lists)
  File "/mntnfs/med_data5/zzh/Dream-of-red-chamber/codes/ColBERT/docs/../colbert/indexing/collection_indexer.py", line 67, in run
    self.train(shared_lists) # Trains centroids from selected passages
  File "/mntnfs/med_data5/zzh/Dream-of-red-chamber/codes/ColBERT/docs/../colbert/indexing/collection_indexer.py", line 225, in train
    bucket_cutoffs, bucket_weights, avg_residual = self._compute_avg_residual(centroids, heldout)
  File "/mntnfs/med_data5/zzh/Dream-of-red-chamber/codes/ColBERT/docs/../colbert/indexing/collection_indexer.py", line 302, in _compute_avg_residual
    compressor = ResidualCodec(config=self.config, centroids=centroids, avg_residual=None)
  File "/mntnfs/med_data5/zzh/Dream-of-red-chamber/codes/ColBERT/docs/../colbert/indexing/codecs/residual.py", line 24, in __init__
    ResidualCodec.try_load_torch_extensions(self.use_gpu)
  File "/mntnfs/med_data5/zzh/Dream-of-red-chamber/codes/ColBERT/docs/../colbert/indexing/codecs/residual.py", line 103, in try_load_torch_extensions
    decompress_residuals_cpp = load(
  File "/mntnfs/med_data5/zzh/anaconda/envs/colbert/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1202, in load
    return _jit_compile(
  File "/mntnfs/med_data5/zzh/anaconda/envs/colbert/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1450, in _jit_compile
    return _import_module_from_library(name, build_directory, is_python_module)
  File "/mntnfs/med_data5/zzh/anaconda/envs/colbert/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1844, in _import_module_from_library
    module = importlib.util.module_from_spec(spec)
  File "<frozen importlib._bootstrap>", line 556, in module_from_spec
  File "<frozen importlib._bootstrap_external>", line 1166, in create_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
ImportError: /home/zzh/.cache/torch_extensions/py38_cu113/decompress_residuals_cpp/decompress_residuals_cpp.so: cannot open shared object file: No such file or directory

Thank you again!

santhnm2 commented 1 year ago

Can you try removing this folder and running again? /home/zzh/.cache/torch_extensions/py38_cu113

okhat commented 1 year ago

Let us know if this helps

zzhheloise commented 1 year ago

Thank you for your advice.

I actually have tried removing the folder /home/zzh/.cache/torch_extensions/py38_cu113 and running again several times before I asked your help. I am so sorry that this method does not solve the error I am faced with.

Best wishes!

palm2333 commented 1 year ago

I also get stuck here, have you solved it yet?

okhat commented 1 year ago

Try reducing the number of passages youre indexing? This code has been working for a long time so we should look into whether some of the dependencies have broken somehow

nstylia commented 1 year ago

I run into the same issue yesterday (regardless of number of passages) but removing py38_cu113 from .cache solved it for me.

I tested with small (10K) and large number of passages (.6M) and both worked fine. The issue appeared to originate by interrupting the process during an initial test run. This resulted in some IO hanging in the attempt to load the output files, which resulted in no errors even when debugging was turned on. Setup is RTX2080 on Ubuntu 20.04 LTS so quite comparable to past reported issues. I didn't have to re-do the conda environment but that might also help in some cases, or even trying the cpu only.

zt991211 commented 1 year ago

Try reducing the number of passages youre indexing? This code has been working for a long time so we should look into whether some of the dependencies have broken somehow

I also faced this problem and have tried to remove the torch_extensions, but still got stuck here. Do you have any solutions? Best Wishes!

This is my error log: [Jun 11, 13:37:02] [0] # of sampled PIDs = 3633 sampled_pids[:3] = [1706, 3001, 41] [Jun 11, 13:37:02] [0] #> Encoding 3633 passages.. [Jun 11, 13:37:10] [0] avg_doclen_est = 234.9237518310547 len(local_sample) = 3,633 [Jun 11, 13:37:10] [0] Creaing 8,192 partitions. [Jun 11, 13:37:10] [0] Estimated 853,477 embeddings. [Jun 11, 13:37:10] [0] #> Saving the indexing plan to /home/zhangtong/ColBERT/experiments/nfcorpus/indexes/nfcorpus.2bits/plan.json .. Clustering 810805 points in 128D to 8192 clusters, redo 1 times, 20 iterations Preprocessing in 0.10 s Iteration 19 (15.56 s, search 15.10 s): objective=183529 imbalance=1.361 nsplit=0
[Jun 11, 13:37:26] Loading decompress_residuals_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)... Using /home/zhangtong/.cache/torch_extensions/py38_cu117 as PyTorch extensions root... Process Process-2: Traceback (most recent call last): File "/home/zhangtong/anaconda3/envs/colbert/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap self.run() File "/home/zhangtong/anaconda3/envs/colbert/lib/python3.8/multiprocessing/process.py", line 108, in run self._target(*self._args, *self._kwargs) File "/home/zhangtong/ColBERT/colbert/infra/launcher.py", line 115, in setup_new_process return_val = callee(config, args) File "/home/zhangtong/ColBERT/colbert/indexing/collection_indexer.py", line 33, in encode encoder.run(shared_lists) File "/home/zhangtong/ColBERT/colbert/indexing/collection_indexer.py", line 67, in run self.train(shared_lists) # Trains centroids from selected passages File "/home/zhangtong/ColBERT/colbert/indexing/collection_indexer.py", line 225, in train bucket_cutoffs, bucket_weights, avg_residual = self._compute_avg_residual(centroids, heldout) File "/home/zhangtong/ColBERT/colbert/indexing/collection_indexer.py", line 302, in _compute_avg_residual compressor = ResidualCodec(config=self.config, centroids=centroids, avg_residual=None) File "/home/zhangtong/ColBERT/colbert/indexing/codecs/residual.py", line 24, in init ResidualCodec.try_load_torch_extensions(self.use_gpu) File "/home/zhangtong/ColBERT/colbert/indexing/codecs/residual.py", line 103, in try_load_torch_extensions decompress_residuals_cpp = load( File "/home/zhangtong/anaconda3/envs/colbert/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1284, in load return _jit_compile( File "/home/zhangtong/anaconda3/envs/colbert/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1508, in _jit_compile _write_ninja_file_and_build_library( File "/home/zhangtong/anaconda3/envs/colbert/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1592, in _write_ninja_file_and_build_library verify_ninja_availability() File "/home/zhangtong/anaconda3/envs/colbert/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1648, in verify_ninja_availability raise RuntimeError("Ninja is required to load C++ extensions") RuntimeError: Ninja is required to load C++ extensions

zt991211 commented 1 year ago

Does it mean the version of Ninja is not suitable?

AbhimanyuSethi-98 commented 1 year ago

Hi! Facing the same issue as above: RuntimeError: Ninja is required to load C++ extensions

palm2333 commented 1 year ago

你好! 面临与上面相同的问题: RuntimeError: Ninja is required to load C++ extensions

I tried all the methods but couldn't solve this error, but it can run only on the CPU

AbhimanyuSethi-98 commented 1 year ago

Hi @zt991211 @palm2333 @zzhheloise, putting a comment here with what worked for me, so if you're still facing the issue I was, maybe this can help. My issue turned out to be an environmental one, and that is in line with other people not facing such issues. I tried this on a couple of environments and am currently on WSL2. Basically, my problem was with CUDA. As per my understanding, the cuda toolkit set up in the conda environment provided with this repo with pytorch comes with some needed runtime libraries, but the code requires nvcc as well. So what worked for me (In WSL2 on Windows 10) was installing cudatoolkit-dev via conda forge: conda install -c conda-forge cudatoolkit-dev Again, this is just what seemed to have worked for me. Hope this is okay @okhat

Thanks!

okhat commented 1 year ago

Folks who are still facing issues can use this Google Colab: https://colab.research.google.com/github/stanford-futuredata/ColBERT/blob/main/docs/intro2new.ipynb

andrenatal commented 11 months ago

I also faced this same issue consistent with the case after the sub-processes crash and are left in a zombie state requiring manual killing. Deleting the cache fixed for me

Cookiesukaze commented 3 months ago

I recently encountered a similar issue. I checked the GPU version of torch and other packages corresponding to the CUDA version, upgraded my GCC version. And I tried several methods, and I'm not certain which one resolved it: sudo apt update sudo apt install build-essential sudo apt-get install ninja-build conda install -c conda-forge cudatoolkit-dev rm -rf /root/.cache/torch_extensions/py38_cu113 Cache files are generated every time they run, so delete them before running. I hope this helps anyone facing the same problem.