stanford-futuredata / ColBERT

ColBERT: state-of-the-art neural search (SIGIR'20, TACL'21, NeurIPS'21, NAACL'22, CIKM'22, ACL'23, EMNLP'23)
MIT License
2.95k stars 377 forks source link

FAILED: decompress_residuals.cuda.o, ninja: build stopped: subcommand failed (from: ColBERT/colbert/indexing/codecs/decompress_residuals.cu) #287

Closed hollstein closed 9 months ago

hollstein commented 9 months ago

Running ColBERT like this:

nbits = 2   # encode each dimension with 2 bits
doc_maxlen = 512
doc_maxlen = 200
checkpoint = 'colbert-ir/colbertv2.0'
index_name = f'{nbits}bits'

# Index data
with Run().context(RunConfig(nranks=1, experiment='test')):  # nranks specifies the number of GPUs to use
    Indexer(
        checkpoint=checkpoint, 
        config=ColBERTConfig(
            # kmeans_niters specifies the number of iterations of k-means clustering; 4 is a good and fast default.
            doc_maxlen=doc_maxlen, nbits=nbits, kmeans_niters=4
        )
    ).index(
        name=index_name, collection=collection, 
        #overwrite='resume',
        overwrite=True
    )
    searcher = Searcher(index=index_name)
print("EEooFF")

Gives me this error:

Process Process-6:
Traceback (most recent call last):
  File "/home/andre_hollstein/.conda/envs/colbert/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1900, in _run_ninja_build
    subprocess.run(
  File "/home/andre_hollstein/.conda/envs/colbert/lib/python3.8/subprocess.py", line 516, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/andre_hollstein/.conda/envs/colbert/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/andre_hollstein/.conda/envs/colbert/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/andre_hollstein/colbert/ColBERT/colbert/infra/launcher.py", line 115, in setup_new_process
    return_val = callee(config, *args)
  File "/home/andre_hollstein/colbert/ColBERT/colbert/indexing/collection_indexer.py", line 33, in encode
    encoder.run(shared_lists)
  File "/home/andre_hollstein/colbert/ColBERT/colbert/indexing/collection_indexer.py", line 67, in run
    self.train(shared_lists) # Trains centroids from selected passages
  File "/home/andre_hollstein/colbert/ColBERT/colbert/indexing/collection_indexer.py", line 225, in train
    bucket_cutoffs, bucket_weights, avg_residual = self._compute_avg_residual(centroids, heldout)
  File "/home/andre_hollstein/colbert/ColBERT/colbert/indexing/collection_indexer.py", line 305, in _compute_avg_residual
    compressor = ResidualCodec(config=self.config, centroids=centroids, avg_residual=None)
  File "/home/andre_hollstein/colbert/ColBERT/colbert/indexing/codecs/residual.py", line 24, in __init__
    ResidualCodec.try_load_torch_extensions(self.use_gpu)
  File "/home/andre_hollstein/colbert/ColBERT/colbert/indexing/codecs/residual.py", line 103, in try_load_torch_extensions
    decompress_residuals_cpp = load(
  File "/home/andre_hollstein/.conda/envs/colbert/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1284, in load
    return _jit_compile(
  File "/home/andre_hollstein/.conda/envs/colbert/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1508, in _jit_compile
    _write_ninja_file_and_build_library(
  File "/home/andre_hollstein/.conda/envs/colbert/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1623, in _write_ninja_file_and_build_library
    _run_ninja_build(
  File "/home/andre_hollstein/.conda/envs/colbert/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1916, in _run_ninja_build
    raise RuntimeError(message) from e
RuntimeError: Error building extension 'decompress_residuals_cpp': [1/2] /usr/bin/nvcc  -DTORCH_EXTENSION_NAME=decompress_residuals_cpp -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /home/andre_hollstein/.conda/envs/colbert/lib/python3.8/site-packages/torch/include -isystem /home/andre_hollstein/.conda/envs/colbert/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /home/andre_hollstein/.conda/envs/colbert/lib/python3.8/site-packages/torch/include/TH -isystem /home/andre_hollstein/.conda/envs/colbert/lib/python3.8/site-packages/torch/include/THC -isystem /home/andre_hollstein/.conda/envs/colbert/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_70,code=compute_70 -gencode=arch=compute_70,code=sm_70 --compiler-options '-fPIC' -std=c++14 -c /home/andre_hollstein/colbert/ColBERT/colbert/indexing/codecs/decompress_residuals.cu -o decompress_residuals.cuda.o 
FAILED: decompress_residuals.cuda.o 
/usr/bin/nvcc  -DTORCH_EXTENSION_NAME=decompress_residuals_cpp -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /home/andre_hollstein/.conda/envs/colbert/lib/python3.8/site-packages/torch/include -isystem /home/andre_hollstein/.conda/envs/colbert/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /home/andre_hollstein/.conda/envs/colbert/lib/python3.8/site-packages/torch/include/TH -isystem /home/andre_hollstein/.conda/envs/colbert/lib/python3.8/site-packages/torch/include/THC -isystem /home/andre_hollstein/.conda/envs/colbert/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_70,code=compute_70 -gencode=arch=compute_70,code=sm_70 --compiler-options '-fPIC' -std=c++14 -c /home/andre_hollstein/colbert/ColBERT/colbert/indexing/codecs/decompress_residuals.cu -o decompress_residuals.cuda.o 
/home/andre_hollstein/.conda/envs/colbert/lib/python3.8/site-packages/torch/include/pybind11/detail/common.h(1040): error: a constexpr variable declaration must be a definition

1 error detected in the compilation of "/tmp/tmpxft_00007c8a_00000000-6_decompress_residuals.cpp1.ii".
ninja: build stopped: subcommand failed.

My python environment is build from conda_env.yml and the code is up-to-date like this:

!git -C ColBERT/ pull || git clone https://github.com/stanford-futuredata/ColBERT.git

Root cause seems to be this file:

ColBERT/colbert/indexing/codecs/decompress_residuals.cu

With this error: error: a constexpr variable declaration must be a definition

Any idea how to solve this?

Kind regards, André

paul7Junior commented 9 months ago

Hey Andre,

It looks most likely like an environment issue, there is some misalignment between your torch virtual env config, CUDA local config or/and compiler local config. Something like a version misalignment or a env var not properly set.

hollstein commented 9 months ago

Thanks, @paul7Junior, any idea what to check, test or change? This error message a constexpr variable declaration must be a definition seems specific, but I found no way of fixing it.

okhat commented 9 months ago

My best guess is that your g++ compiler is too old. Can you check the version of g++?

hollstein commented 9 months ago

Thanks @okhat! I was indeed on old compiler versions (gcc 7 and g++ 7). I switched over to a sagemager instance to get newer versions of them:

gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

g++ (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0

When I run this:

nbits = 2   # encode each dimension with 2 bits
doc_maxlen = 512
doc_maxlen = 200
checkpoint = 'colbert-ir/colbertv2.0'
index_name = f'dimi.{nbits}bits'

# Index data
with Run().context(RunConfig(nranks=1, experiment='test')):  # nranks specifies the number of GPUs to use
    Indexer(
        checkpoint=checkpoint, 
        config=ColBERTConfig(
            # kmeans_niters specifies the number of iterations of k-means clustering; 4 is a good and fast default.
            doc_maxlen=doc_maxlen, nbits=nbits, kmeans_niters=4
        )
    ).index(
        name=index_name, collection=collection, 
        #overwrite='resume',
        overwrite=True
    )
    searcher = Searcher(index=index_name)
print("EEooFF")

I still get an error, although a different one:

[Jan 04, 07:38:31] #> Creating directory /root/experiments/test/indexes/dimi.2bits 

#> Starting...
nranks = 1   num_gpus = 1    device=0
{
    "query_token_id": "[unused0]",
    "doc_token_id": "[unused1]",
    "query_token": "[Q]",
    "doc_token": "[D]",
    "ncells": null,
    "centroid_score_threshold": null,
    "ndocs": null,
    "load_index_with_mmap": false,
    "index_path": null,
    "nbits": 2,
    "kmeans_niters": 4,
    "resume": false,
    "similarity": "cosine",
    "bsize": 64,
    "accumsteps": 1,
    "lr": 1e-5,
    "maxsteps": 400000,
    "save_every": null,
    "warmup": 20000,
    "warmup_bert": null,
    "relu": false,
    "nway": 64,
    "use_ib_negatives": true,
    "reranker": false,
    "distillation_alpha": 1.0,
    "ignore_scores": false,
    "model_name": null,
    "query_maxlen": 32,
    "attend_to_mask_tokens": false,
    "interaction": "colbert",
    "dim": 128,
    "doc_maxlen": 200,
    "mask_punctuation": true,
    "checkpoint": "colbert-ir\/colbertv2.0",
    "triples": "\/future\/u\/okhattab\/root\/unit\/experiments\/2021.10\/downstream.distillation.round2.2_score\/round2.nway6.cosine.ib\/examples.64.json",
    "collection": [
        "list with 149565 elements starting with...",
        [
            "Replacement Request",
            "Dr requests information on whether there is evidence for the use of Vericiguat in patients with right heart dysfunction or valve disease and heart failure. I will be grateful if you include me in copy in the answer.",
            "Are ovarian cysts common in women taking pure progestin pills?"
        ]
    ],
    "queries": "\/future\/u\/okhattab\/data\/MSMARCO\/queries.train.tsv",
    "index_name": "dimi.2bits",
    "overwrite": false,
    "root": "\/root\/experiments",
    "experiment": "test",
    "index_root": null,
    "name": "2024-01\/04\/07.31.46",
    "rank": 0,
    "nranks": 1,
    "amp": true,
    "gpus": 1
}
config.json: 100%|██████████| 743/743 [00:00<00:00, 5.55MB/s]
pytorch_model.bin: 100%|██████████| 438M/438M [00:01<00:00, 398MB/s] 
tokenizer_config.json: 100%|██████████| 405/405 [00:00<00:00, 2.30MB/s]
vocab.txt: 100%|██████████| 232k/232k [00:00<00:00, 1.28MB/s]
tokenizer.json: 100%|██████████| 466k/466k [00:00<00:00, 1.72MB/s]
special_tokens_map.json: 100%|██████████| 112/112 [00:00<00:00, 628kB/s]
[Jan 04, 07:38:49] [0]       # of sampled PIDs = 67784   sampled_pids[:3] = [109214, 2665, 78286]
[Jan 04, 07:38:49] [0]       #> Encoding 67784 passages..
pytorch-2-0-0-gpu-p-ml-g4dn-xlarge-833749c2ea3c2ae27eceeea34f79:174:174 [0] NCCL INFO Bootstrap : Using lo:127.0.0.1<0>
pytorch-2-0-0-gpu-p-ml-g4dn-xlarge-833749c2ea3c2ae27eceeea34f79:174:174 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
pytorch-2-0-0-gpu-p-ml-g4dn-xlarge-833749c2ea3c2ae27eceeea34f79:174:174 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (v4 or v5).
pytorch-2-0-0-gpu-p-ml-g4dn-xlarge-833749c2ea3c2ae27eceeea34f79:174:174 [0] NCCL INFO cudaDriverVersion 11080
NCCL version 2.16.2+cuda11.8
pytorch-2-0-0-gpu-p-ml-g4dn-xlarge-833749c2ea3c2ae27eceeea34f79:174:270 [0] NCCL INFO NET/OFI Using aws-ofi-nccl 1.5.0aws
pytorch-2-0-0-gpu-p-ml-g4dn-xlarge-833749c2ea3c2ae27eceeea34f79:174:270 [0] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1

pytorch-2-0-0-gpu-p-ml-g4dn-xlarge-833749c2ea3c2ae27eceeea34f79:174:270 [0] nccl_net_ofi_init:1444 NCCL WARN NET/OFI Only EFA provider is supported

pytorch-2-0-0-gpu-p-ml-g4dn-xlarge-833749c2ea3c2ae27eceeea34f79:174:270 [0] nccl_net_ofi_init:1483 NCCL WARN NET/OFI aws-ofi-nccl initialization failed
pytorch-2-0-0-gpu-p-ml-g4dn-xlarge-833749c2ea3c2ae27eceeea34f79:174:270 [0] NCCL INFO NET/IB : No device found.
pytorch-2-0-0-gpu-p-ml-g4dn-xlarge-833749c2ea3c2ae27eceeea34f79:174:270 [0] NCCL INFO NET/Socket : Using [0]lo:127.0.0.1<0> [1]veth-app0-2:169.255.255.2<0>
pytorch-2-0-0-gpu-p-ml-g4dn-xlarge-833749c2ea3c2ae27eceeea34f79:174:270 [0] NCCL INFO Using network Socket
pytorch-2-0-0-gpu-p-ml-g4dn-xlarge-833749c2ea3c2ae27eceeea34f79:174:270 [0] NCCL INFO Channel 00/32 :    0
pytorch-2-0-0-gpu-p-ml-g4dn-xlarge-833749c2ea3c2ae27eceeea34f79:174:270 [0] NCCL INFO Channel 01/32 :    0
pytorch-2-0-0-gpu-p-ml-g4dn-xlarge-833749c2ea3c2ae27eceeea34f79:174:270 [0] NCCL INFO Channel 02/32 :    0
pytorch-2-0-0-gpu-p-ml-g4dn-xlarge-833749c2ea3c2ae27eceeea34f79:174:270 [0] NCCL INFO Channel 03/32 :    0
pytorch-2-0-0-gpu-p-ml-g4dn-xlarge-833749c2ea3c2ae27eceeea34f79:174:270 [0] NCCL INFO Channel 04/32 :    0
pytorch-2-0-0-gpu-p-ml-g4dn-xlarge-833749c2ea3c2ae27eceeea34f79:174:270 [0] NCCL INFO Channel 05/32 :    0
pytorch-2-0-0-gpu-p-ml-g4dn-xlarge-833749c2ea3c2ae27eceeea34f79:174:270 [0] NCCL INFO Channel 06/32 :    0
pytorch-2-0-0-gpu-p-ml-g4dn-xlarge-833749c2ea3c2ae27eceeea34f79:174:270 [0] NCCL INFO Channel 07/32 :    0
pytorch-2-0-0-gpu-p-ml-g4dn-xlarge-833749c2ea3c2ae27eceeea34f79:174:270 [0] NCCL INFO Channel 08/32 :    0
pytorch-2-0-0-gpu-p-ml-g4dn-xlarge-833749c2ea3c2ae27eceeea34f79:174:270 [0] NCCL INFO Channel 09/32 :    0
pytorch-2-0-0-gpu-p-ml-g4dn-xlarge-833749c2ea3c2ae27eceeea34f79:174:270 [0] NCCL INFO Channel 10/32 :    0
pytorch-2-0-0-gpu-p-ml-g4dn-xlarge-833749c2ea3c2ae27eceeea34f79:174:270 [0] NCCL INFO Channel 11/32 :    0
pytorch-2-0-0-gpu-p-ml-g4dn-xlarge-833749c2ea3c2ae27eceeea34f79:174:270 [0] NCCL INFO Channel 12/32 :    0
pytorch-2-0-0-gpu-p-ml-g4dn-xlarge-833749c2ea3c2ae27eceeea34f79:174:270 [0] NCCL INFO Channel 13/32 :    0
pytorch-2-0-0-gpu-p-ml-g4dn-xlarge-833749c2ea3c2ae27eceeea34f79:174:270 [0] NCCL INFO Channel 14/32 :    0
pytorch-2-0-0-gpu-p-ml-g4dn-xlarge-833749c2ea3c2ae27eceeea34f79:174:270 [0] NCCL INFO Channel 15/32 :    0
pytorch-2-0-0-gpu-p-ml-g4dn-xlarge-833749c2ea3c2ae27eceeea34f79:174:270 [0] NCCL INFO Channel 16/32 :    0
pytorch-2-0-0-gpu-p-ml-g4dn-xlarge-833749c2ea3c2ae27eceeea34f79:174:270 [0] NCCL INFO Channel 17/32 :    0
pytorch-2-0-0-gpu-p-ml-g4dn-xlarge-833749c2ea3c2ae27eceeea34f79:174:270 [0] NCCL INFO Channel 18/32 :    0
pytorch-2-0-0-gpu-p-ml-g4dn-xlarge-833749c2ea3c2ae27eceeea34f79:174:270 [0] NCCL INFO Channel 19/32 :    0
pytorch-2-0-0-gpu-p-ml-g4dn-xlarge-833749c2ea3c2ae27eceeea34f79:174:270 [0] NCCL INFO Channel 20/32 :    0
pytorch-2-0-0-gpu-p-ml-g4dn-xlarge-833749c2ea3c2ae27eceeea34f79:174:270 [0] NCCL INFO Channel 21/32 :    0
pytorch-2-0-0-gpu-p-ml-g4dn-xlarge-833749c2ea3c2ae27eceeea34f79:174:270 [0] NCCL INFO Channel 22/32 :    0
pytorch-2-0-0-gpu-p-ml-g4dn-xlarge-833749c2ea3c2ae27eceeea34f79:174:270 [0] NCCL INFO Channel 23/32 :    0
pytorch-2-0-0-gpu-p-ml-g4dn-xlarge-833749c2ea3c2ae27eceeea34f79:174:270 [0] NCCL INFO Channel 24/32 :    0
pytorch-2-0-0-gpu-p-ml-g4dn-xlarge-833749c2ea3c2ae27eceeea34f79:174:270 [0] NCCL INFO Channel 25/32 :    0
pytorch-2-0-0-gpu-p-ml-g4dn-xlarge-833749c2ea3c2ae27eceeea34f79:174:270 [0] NCCL INFO Channel 26/32 :    0
pytorch-2-0-0-gpu-p-ml-g4dn-xlarge-833749c2ea3c2ae27eceeea34f79:174:270 [0] NCCL INFO Channel 27/32 :    0
pytorch-2-0-0-gpu-p-ml-g4dn-xlarge-833749c2ea3c2ae27eceeea34f79:174:270 [0] NCCL INFO Channel 28/32 :    0
pytorch-2-0-0-gpu-p-ml-g4dn-xlarge-833749c2ea3c2ae27eceeea34f79:174:270 [0] NCCL INFO Channel 29/32 :    0
pytorch-2-0-0-gpu-p-ml-g4dn-xlarge-833749c2ea3c2ae27eceeea34f79:174:270 [0] NCCL INFO Channel 30/32 :    0
pytorch-2-0-0-gpu-p-ml-g4dn-xlarge-833749c2ea3c2ae27eceeea34f79:174:270 [0] NCCL INFO Channel 31/32 :    0
pytorch-2-0-0-gpu-p-ml-g4dn-xlarge-833749c2ea3c2ae27eceeea34f79:174:270 [0] NCCL INFO Trees [0] -1/-1/-1->0->-1 [1] -1/-1/-1->0->-1 [2] -1/-1/-1->0->-1 [3] -1/-1/-1->0->-1 [4] -1/-1/-1->0->-1 [5] -1/-1/-1->0->-1 [6] -1/-1/-1->0->-1 [7] -1/-1/-1->0->-1 [8] -1/-1/-1->0->-1 [9] -1/-1/-1->0->-1 [10] -1/-1/-1->0->-1 [11] -1/-1/-1->0->-1 [12] -1/-1/-1->0->-1 [13] -1/-1/-1->0->-1 [14] -1/-1/-1->0->-1 [15] -1/-1/-1->0->-1 [16] -1/-1/-1->0->-1 [17] -1/-1/-1->0->-1 [18] -1/-1/-1->0->-1 [19] -1/-1/-1->0->-1 [20] -1/-1/-1->0->-1 [21] -1/-1/-1->0->-1 [22] -1/-1/-1->0->-1 [23] -1/-1/-1->0->-1 [24] -1/-1/-1->0->-1 [25] -1/-1/-1->0->-1 [26] -1/-1/-1->0->-1 [27] -1/-1/-1->0->-1 [28] -1/-1/-1->0->-1 [29] -1/-1/-1->0->-1 [30] -1/-1/-1->0->-1 [31] -1/-1/-1->0->-1
pytorch-2-0-0-gpu-p-ml-g4dn-xlarge-833749c2ea3c2ae27eceeea34f79:174:270 [0] NCCL INFO P2P Chunksize set to 131072
pytorch-2-0-0-gpu-p-ml-g4dn-xlarge-833749c2ea3c2ae27eceeea34f79:174:270 [0] NCCL INFO Connected all rings
pytorch-2-0-0-gpu-p-ml-g4dn-xlarge-833749c2ea3c2ae27eceeea34f79:174:270 [0] NCCL INFO Connected all trees
pytorch-2-0-0-gpu-p-ml-g4dn-xlarge-833749c2ea3c2ae27eceeea34f79:174:270 [0] NCCL INFO 32 coll channels, 32 p2p channels, 32 p2p channels per peer
pytorch-2-0-0-gpu-p-ml-g4dn-xlarge-833749c2ea3c2ae27eceeea34f79:174:270 [0] NCCL INFO comm 0x55cf448d56e0 rank 0 nranks 1 cudaDev 0 busId 1e0 commId 0x610d18cb08558a16 - Init COMPLETE
[Jan 04, 07:42:21] [0]       avg_doclen_est = 48.567626953125    len(local_sample) = 67,784
[Jan 04, 07:42:30] [0]       Creating 32,768 partitions.
[Jan 04, 07:42:30] [0]       *Estimated* 7,264,017 embeddings.
[Jan 04, 07:42:30] [0]       #> Saving the indexing plan to /root/experiments/test/indexes/dimi.2bits/plan.json ..
Clustering 3242108 points in 128D to 32768 clusters, redo 1 times, 4 iterations
  Preprocessing in 0.42 s
  Iteration 3 (36.36 s, search 35.39 s): objective=727518 imbalance=1.371 nsplit=0           
[Jan 04, 07:43:09] Loading decompress_residuals_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...
Process Process-2:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1893, in _run_ninja_build
    subprocess.run(
  File "/opt/conda/lib/python3.10/subprocess.py", line 526, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/root/ColBERT/colbert/infra/launcher.py", line 115, in setup_new_process
    return_val = callee(config, *args)
  File "/root/ColBERT/colbert/indexing/collection_indexer.py", line 33, in encode
    encoder.run(shared_lists)
  File "/root/ColBERT/colbert/indexing/collection_indexer.py", line 68, in run
    self.train(shared_lists) # Trains centroids from selected passages
  File "/root/ColBERT/colbert/indexing/collection_indexer.py", line 229, in train
    bucket_cutoffs, bucket_weights, avg_residual = self._compute_avg_residual(centroids, heldout)
  File "/root/ColBERT/colbert/indexing/collection_indexer.py", line 307, in _compute_avg_residual
    compressor = ResidualCodec(config=self.config, centroids=centroids, avg_residual=None)
  File "/root/ColBERT/colbert/indexing/codecs/residual.py", line 24, in __init__
    ResidualCodec.try_load_torch_extensions(self.use_gpu)
  File "/root/ColBERT/colbert/indexing/codecs/residual.py", line 103, in try_load_torch_extensions
    decompress_residuals_cpp = load(
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1284, in load
    return _jit_compile(
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1509, in _jit_compile
    _write_ninja_file_and_build_library(
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1624, in _write_ninja_file_and_build_library
    _run_ninja_build(
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1909, in _run_ninja_build
    raise RuntimeError(message) from e
RuntimeError: Error building extension 'decompress_residuals_cpp': [1/3] c++ -MMD -MF decompress_residuals.o.d -DTORCH_EXTENSION_NAME=decompress_residuals_cpp -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /opt/conda/lib/python3.10/site-packages/torch/include -isystem /opt/conda/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/lib/python3.10/site-packages/torch/include/TH -isystem /opt/conda/lib/python3.10/site-packages/torch/include/THC -isystem /opt/conda/include -isystem /opt/conda/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -c /root/ColBERT/colbert/indexing/codecs/decompress_residuals.cpp -o decompress_residuals.o 
[2/3] /opt/conda/bin/nvcc  -DTORCH_EXTENSION_NAME=decompress_residuals_cpp -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /opt/conda/lib/python3.10/site-packages/torch/include -isystem /opt/conda/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/lib/python3.10/site-packages/torch/include/TH -isystem /opt/conda/lib/python3.10/site-packages/torch/include/THC -isystem /opt/conda/include -isystem /opt/conda/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_37,code=sm_37 -gencode=arch=compute_50,code=sm_50 -gencode=arch=compute_70,code=compute_70 -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -std=c++17 -c /root/ColBERT/colbert/indexing/codecs/decompress_residuals.cu -o decompress_residuals.cuda.o 
nvcc warning : The 'compute_35', 'compute_37', 'sm_35', and 'sm_37' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
/root/ColBERT/colbert/indexing/codecs/decompress_residuals.cu: In function ‘at::Tensor decompress_residuals_cuda(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor, int, int)’:
/root/ColBERT/colbert/indexing/codecs/decompress_residuals.cu:61:127: warning: ‘T* at::Tensor::data() const [with T = unsigned char]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
   61 |     decompress_residuals_kernel<<<blocks, threads>>>(
      |                                                                                                                               ^
/opt/conda/lib/python3.10/site-packages/torch/include/ATen/core/TensorBody.h:244:1: note: declared here
  244 |   T * data() const {
      | ^ ~~
/root/ColBERT/colbert/indexing/codecs/decompress_residuals.cu:61:593: warning: ‘T* at::Tensor::data() const [with T = c10::Half]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
   61 |     decompress_residuals_kernel<<<blocks, threads>>>(
      |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 ^
/opt/conda/lib/python3.10/site-packages/torch/include/ATen/core/TensorBody.h:244:1: note: declared here
  244 |   T * data() const {
      | ^ ~~
[3/3] c++ decompress_residuals.o decompress_residuals.cuda.o -shared -L/opt/conda/lib/python3.10/site-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda -ltorch -ltorch_python -L/opt/conda/lib64 -lcudart -o decompress_residuals_cpp.so
FAILED: decompress_residuals_cpp.so 
c++ decompress_residuals.o decompress_residuals.cuda.o -shared -L/opt/conda/lib/python3.10/site-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda -ltorch -ltorch_python -L/opt/conda/lib64 -lcudart -o decompress_residuals_cpp.so
/usr/bin/ld: cannot find -lcudart
collect2: error: ld returned 1 exit status
ninja: build stopped: subcommand failed.

The main error seems to be:

[3/3] c++ decompress_residuals.o decompress_residuals.cuda.o -shared -L/opt/conda/lib/python3.10/site-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda -ltorch -ltorch_python -L/opt/conda/lib64 -lcudart -o decompress_residuals_cpp.so
FAILED: decompress_residuals_cpp.so 
c++ decompress_residuals.o decompress_residuals.cuda.o -shared -L/opt/conda/lib/python3.10/site-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda -ltorch -ltorch_python -L/opt/conda/lib64 -lcudart -o decompress_residuals_cpp.so
/usr/bin/ld: cannot find -lcudart
collect2: error: ld returned 1 exit status

Due to my sagemaker image I'm on python 3.10 and not 3.8 as given in the conda_env.yml, however I suspect this is not python version related.

Furhter checking my environemt, $CUDA_HOME is /opt/conda/ which looks fine to me:

(base) root@pytorch-2-0-0-gpu-p-ml-g4dn-xlarge-833749c2ea3c2ae27eceeea34f79:~# ls -al $CUDA_HOME/lib/libcuda*
-rw-rw-r-- 1 root root 1021860 Sep 21  2022 /opt/conda//lib/libcudadevrt.a
lrwxrwxrwx 1 root root      20 May 11  2023 /opt/conda//lib/libcudart.so -> libcudart.so.11.8.89
lrwxrwxrwx 1 root root      20 May 11  2023 /opt/conda//lib/libcudart.so.11.0 -> libcudart.so.11.8.89
-rwxrwxr-x 1 root root  695712 Sep 21  2022 /opt/conda//lib/libcudart.so.11.8.89
-rw-rw-r-- 1 root root 1198880 Sep 21  2022 /opt/conda//lib/libcudart_static.a

Looking at LD_LIBRARY_PATH:/opt/conda/lib/python3.10/site-packages/smdistributed/dataparallel/lib:/opt/amazon/openmpi/lib/:/opt/amazon/efa/lib/:/opt/conda/lib:/usr/local/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/lib

Any further idea? Thanks!

hollstein commented 9 months ago

Solved it, using the conda_env.yml on sagemaker worked in the end. Good learning for me. Thanks @okhat and @paul7Junior .

heshanxiu commented 3 months ago

Solved it, using the conda_env.yml on sagemaker worked in the end. Good learning for me. Thanks @okhat and @paul7Junior . Hi! I wonder how do you resolve the problem at the end? I think I encounter the exact same problem here. Thank you so much!

ysunbp commented 2 months ago

Solved it, using the conda_env.yml on sagemaker worked in the end. Good learning for me. Thanks @okhat and @paul7Junior .

Same issue happened to me. Using the yaml file does not work. Any idea?