Closed hollstein closed 9 months ago
Hey Andre,
It looks most likely like an environment issue, there is some misalignment between your torch virtual env config, CUDA local config or/and compiler local config. Something like a version misalignment or a env var not properly set.
Thanks, @paul7Junior, any idea what to check, test or change? This error message a constexpr variable declaration must be a definition
seems specific, but I found no way of fixing it.
My best guess is that your g++ compiler is too old. Can you check the version of g++?
Thanks @okhat! I was indeed on old compiler versions (gcc 7 and g++ 7). I switched over to a sagemager instance to get newer versions of them:
gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
g++ (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0
When I run this:
nbits = 2 # encode each dimension with 2 bits
doc_maxlen = 512
doc_maxlen = 200
checkpoint = 'colbert-ir/colbertv2.0'
index_name = f'dimi.{nbits}bits'
# Index data
with Run().context(RunConfig(nranks=1, experiment='test')): # nranks specifies the number of GPUs to use
Indexer(
checkpoint=checkpoint,
config=ColBERTConfig(
# kmeans_niters specifies the number of iterations of k-means clustering; 4 is a good and fast default.
doc_maxlen=doc_maxlen, nbits=nbits, kmeans_niters=4
)
).index(
name=index_name, collection=collection,
#overwrite='resume',
overwrite=True
)
searcher = Searcher(index=index_name)
print("EEooFF")
I still get an error, although a different one:
[Jan 04, 07:38:31] #> Creating directory /root/experiments/test/indexes/dimi.2bits
#> Starting...
nranks = 1 num_gpus = 1 device=0
{
"query_token_id": "[unused0]",
"doc_token_id": "[unused1]",
"query_token": "[Q]",
"doc_token": "[D]",
"ncells": null,
"centroid_score_threshold": null,
"ndocs": null,
"load_index_with_mmap": false,
"index_path": null,
"nbits": 2,
"kmeans_niters": 4,
"resume": false,
"similarity": "cosine",
"bsize": 64,
"accumsteps": 1,
"lr": 1e-5,
"maxsteps": 400000,
"save_every": null,
"warmup": 20000,
"warmup_bert": null,
"relu": false,
"nway": 64,
"use_ib_negatives": true,
"reranker": false,
"distillation_alpha": 1.0,
"ignore_scores": false,
"model_name": null,
"query_maxlen": 32,
"attend_to_mask_tokens": false,
"interaction": "colbert",
"dim": 128,
"doc_maxlen": 200,
"mask_punctuation": true,
"checkpoint": "colbert-ir\/colbertv2.0",
"triples": "\/future\/u\/okhattab\/root\/unit\/experiments\/2021.10\/downstream.distillation.round2.2_score\/round2.nway6.cosine.ib\/examples.64.json",
"collection": [
"list with 149565 elements starting with...",
[
"Replacement Request",
"Dr requests information on whether there is evidence for the use of Vericiguat in patients with right heart dysfunction or valve disease and heart failure. I will be grateful if you include me in copy in the answer.",
"Are ovarian cysts common in women taking pure progestin pills?"
]
],
"queries": "\/future\/u\/okhattab\/data\/MSMARCO\/queries.train.tsv",
"index_name": "dimi.2bits",
"overwrite": false,
"root": "\/root\/experiments",
"experiment": "test",
"index_root": null,
"name": "2024-01\/04\/07.31.46",
"rank": 0,
"nranks": 1,
"amp": true,
"gpus": 1
}
config.json: 100%|██████████| 743/743 [00:00<00:00, 5.55MB/s]
pytorch_model.bin: 100%|██████████| 438M/438M [00:01<00:00, 398MB/s]
tokenizer_config.json: 100%|██████████| 405/405 [00:00<00:00, 2.30MB/s]
vocab.txt: 100%|██████████| 232k/232k [00:00<00:00, 1.28MB/s]
tokenizer.json: 100%|██████████| 466k/466k [00:00<00:00, 1.72MB/s]
special_tokens_map.json: 100%|██████████| 112/112 [00:00<00:00, 628kB/s]
[Jan 04, 07:38:49] [0] # of sampled PIDs = 67784 sampled_pids[:3] = [109214, 2665, 78286]
[Jan 04, 07:38:49] [0] #> Encoding 67784 passages..
pytorch-2-0-0-gpu-p-ml-g4dn-xlarge-833749c2ea3c2ae27eceeea34f79:174:174 [0] NCCL INFO Bootstrap : Using lo:127.0.0.1<0>
pytorch-2-0-0-gpu-p-ml-g4dn-xlarge-833749c2ea3c2ae27eceeea34f79:174:174 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
pytorch-2-0-0-gpu-p-ml-g4dn-xlarge-833749c2ea3c2ae27eceeea34f79:174:174 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (v4 or v5).
pytorch-2-0-0-gpu-p-ml-g4dn-xlarge-833749c2ea3c2ae27eceeea34f79:174:174 [0] NCCL INFO cudaDriverVersion 11080
NCCL version 2.16.2+cuda11.8
pytorch-2-0-0-gpu-p-ml-g4dn-xlarge-833749c2ea3c2ae27eceeea34f79:174:270 [0] NCCL INFO NET/OFI Using aws-ofi-nccl 1.5.0aws
pytorch-2-0-0-gpu-p-ml-g4dn-xlarge-833749c2ea3c2ae27eceeea34f79:174:270 [0] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
pytorch-2-0-0-gpu-p-ml-g4dn-xlarge-833749c2ea3c2ae27eceeea34f79:174:270 [0] nccl_net_ofi_init:1444 NCCL WARN NET/OFI Only EFA provider is supported
pytorch-2-0-0-gpu-p-ml-g4dn-xlarge-833749c2ea3c2ae27eceeea34f79:174:270 [0] nccl_net_ofi_init:1483 NCCL WARN NET/OFI aws-ofi-nccl initialization failed
pytorch-2-0-0-gpu-p-ml-g4dn-xlarge-833749c2ea3c2ae27eceeea34f79:174:270 [0] NCCL INFO NET/IB : No device found.
pytorch-2-0-0-gpu-p-ml-g4dn-xlarge-833749c2ea3c2ae27eceeea34f79:174:270 [0] NCCL INFO NET/Socket : Using [0]lo:127.0.0.1<0> [1]veth-app0-2:169.255.255.2<0>
pytorch-2-0-0-gpu-p-ml-g4dn-xlarge-833749c2ea3c2ae27eceeea34f79:174:270 [0] NCCL INFO Using network Socket
pytorch-2-0-0-gpu-p-ml-g4dn-xlarge-833749c2ea3c2ae27eceeea34f79:174:270 [0] NCCL INFO Channel 00/32 : 0
pytorch-2-0-0-gpu-p-ml-g4dn-xlarge-833749c2ea3c2ae27eceeea34f79:174:270 [0] NCCL INFO Channel 01/32 : 0
pytorch-2-0-0-gpu-p-ml-g4dn-xlarge-833749c2ea3c2ae27eceeea34f79:174:270 [0] NCCL INFO Channel 02/32 : 0
pytorch-2-0-0-gpu-p-ml-g4dn-xlarge-833749c2ea3c2ae27eceeea34f79:174:270 [0] NCCL INFO Channel 03/32 : 0
pytorch-2-0-0-gpu-p-ml-g4dn-xlarge-833749c2ea3c2ae27eceeea34f79:174:270 [0] NCCL INFO Channel 04/32 : 0
pytorch-2-0-0-gpu-p-ml-g4dn-xlarge-833749c2ea3c2ae27eceeea34f79:174:270 [0] NCCL INFO Channel 05/32 : 0
pytorch-2-0-0-gpu-p-ml-g4dn-xlarge-833749c2ea3c2ae27eceeea34f79:174:270 [0] NCCL INFO Channel 06/32 : 0
pytorch-2-0-0-gpu-p-ml-g4dn-xlarge-833749c2ea3c2ae27eceeea34f79:174:270 [0] NCCL INFO Channel 07/32 : 0
pytorch-2-0-0-gpu-p-ml-g4dn-xlarge-833749c2ea3c2ae27eceeea34f79:174:270 [0] NCCL INFO Channel 08/32 : 0
pytorch-2-0-0-gpu-p-ml-g4dn-xlarge-833749c2ea3c2ae27eceeea34f79:174:270 [0] NCCL INFO Channel 09/32 : 0
pytorch-2-0-0-gpu-p-ml-g4dn-xlarge-833749c2ea3c2ae27eceeea34f79:174:270 [0] NCCL INFO Channel 10/32 : 0
pytorch-2-0-0-gpu-p-ml-g4dn-xlarge-833749c2ea3c2ae27eceeea34f79:174:270 [0] NCCL INFO Channel 11/32 : 0
pytorch-2-0-0-gpu-p-ml-g4dn-xlarge-833749c2ea3c2ae27eceeea34f79:174:270 [0] NCCL INFO Channel 12/32 : 0
pytorch-2-0-0-gpu-p-ml-g4dn-xlarge-833749c2ea3c2ae27eceeea34f79:174:270 [0] NCCL INFO Channel 13/32 : 0
pytorch-2-0-0-gpu-p-ml-g4dn-xlarge-833749c2ea3c2ae27eceeea34f79:174:270 [0] NCCL INFO Channel 14/32 : 0
pytorch-2-0-0-gpu-p-ml-g4dn-xlarge-833749c2ea3c2ae27eceeea34f79:174:270 [0] NCCL INFO Channel 15/32 : 0
pytorch-2-0-0-gpu-p-ml-g4dn-xlarge-833749c2ea3c2ae27eceeea34f79:174:270 [0] NCCL INFO Channel 16/32 : 0
pytorch-2-0-0-gpu-p-ml-g4dn-xlarge-833749c2ea3c2ae27eceeea34f79:174:270 [0] NCCL INFO Channel 17/32 : 0
pytorch-2-0-0-gpu-p-ml-g4dn-xlarge-833749c2ea3c2ae27eceeea34f79:174:270 [0] NCCL INFO Channel 18/32 : 0
pytorch-2-0-0-gpu-p-ml-g4dn-xlarge-833749c2ea3c2ae27eceeea34f79:174:270 [0] NCCL INFO Channel 19/32 : 0
pytorch-2-0-0-gpu-p-ml-g4dn-xlarge-833749c2ea3c2ae27eceeea34f79:174:270 [0] NCCL INFO Channel 20/32 : 0
pytorch-2-0-0-gpu-p-ml-g4dn-xlarge-833749c2ea3c2ae27eceeea34f79:174:270 [0] NCCL INFO Channel 21/32 : 0
pytorch-2-0-0-gpu-p-ml-g4dn-xlarge-833749c2ea3c2ae27eceeea34f79:174:270 [0] NCCL INFO Channel 22/32 : 0
pytorch-2-0-0-gpu-p-ml-g4dn-xlarge-833749c2ea3c2ae27eceeea34f79:174:270 [0] NCCL INFO Channel 23/32 : 0
pytorch-2-0-0-gpu-p-ml-g4dn-xlarge-833749c2ea3c2ae27eceeea34f79:174:270 [0] NCCL INFO Channel 24/32 : 0
pytorch-2-0-0-gpu-p-ml-g4dn-xlarge-833749c2ea3c2ae27eceeea34f79:174:270 [0] NCCL INFO Channel 25/32 : 0
pytorch-2-0-0-gpu-p-ml-g4dn-xlarge-833749c2ea3c2ae27eceeea34f79:174:270 [0] NCCL INFO Channel 26/32 : 0
pytorch-2-0-0-gpu-p-ml-g4dn-xlarge-833749c2ea3c2ae27eceeea34f79:174:270 [0] NCCL INFO Channel 27/32 : 0
pytorch-2-0-0-gpu-p-ml-g4dn-xlarge-833749c2ea3c2ae27eceeea34f79:174:270 [0] NCCL INFO Channel 28/32 : 0
pytorch-2-0-0-gpu-p-ml-g4dn-xlarge-833749c2ea3c2ae27eceeea34f79:174:270 [0] NCCL INFO Channel 29/32 : 0
pytorch-2-0-0-gpu-p-ml-g4dn-xlarge-833749c2ea3c2ae27eceeea34f79:174:270 [0] NCCL INFO Channel 30/32 : 0
pytorch-2-0-0-gpu-p-ml-g4dn-xlarge-833749c2ea3c2ae27eceeea34f79:174:270 [0] NCCL INFO Channel 31/32 : 0
pytorch-2-0-0-gpu-p-ml-g4dn-xlarge-833749c2ea3c2ae27eceeea34f79:174:270 [0] NCCL INFO Trees [0] -1/-1/-1->0->-1 [1] -1/-1/-1->0->-1 [2] -1/-1/-1->0->-1 [3] -1/-1/-1->0->-1 [4] -1/-1/-1->0->-1 [5] -1/-1/-1->0->-1 [6] -1/-1/-1->0->-1 [7] -1/-1/-1->0->-1 [8] -1/-1/-1->0->-1 [9] -1/-1/-1->0->-1 [10] -1/-1/-1->0->-1 [11] -1/-1/-1->0->-1 [12] -1/-1/-1->0->-1 [13] -1/-1/-1->0->-1 [14] -1/-1/-1->0->-1 [15] -1/-1/-1->0->-1 [16] -1/-1/-1->0->-1 [17] -1/-1/-1->0->-1 [18] -1/-1/-1->0->-1 [19] -1/-1/-1->0->-1 [20] -1/-1/-1->0->-1 [21] -1/-1/-1->0->-1 [22] -1/-1/-1->0->-1 [23] -1/-1/-1->0->-1 [24] -1/-1/-1->0->-1 [25] -1/-1/-1->0->-1 [26] -1/-1/-1->0->-1 [27] -1/-1/-1->0->-1 [28] -1/-1/-1->0->-1 [29] -1/-1/-1->0->-1 [30] -1/-1/-1->0->-1 [31] -1/-1/-1->0->-1
pytorch-2-0-0-gpu-p-ml-g4dn-xlarge-833749c2ea3c2ae27eceeea34f79:174:270 [0] NCCL INFO P2P Chunksize set to 131072
pytorch-2-0-0-gpu-p-ml-g4dn-xlarge-833749c2ea3c2ae27eceeea34f79:174:270 [0] NCCL INFO Connected all rings
pytorch-2-0-0-gpu-p-ml-g4dn-xlarge-833749c2ea3c2ae27eceeea34f79:174:270 [0] NCCL INFO Connected all trees
pytorch-2-0-0-gpu-p-ml-g4dn-xlarge-833749c2ea3c2ae27eceeea34f79:174:270 [0] NCCL INFO 32 coll channels, 32 p2p channels, 32 p2p channels per peer
pytorch-2-0-0-gpu-p-ml-g4dn-xlarge-833749c2ea3c2ae27eceeea34f79:174:270 [0] NCCL INFO comm 0x55cf448d56e0 rank 0 nranks 1 cudaDev 0 busId 1e0 commId 0x610d18cb08558a16 - Init COMPLETE
[Jan 04, 07:42:21] [0] avg_doclen_est = 48.567626953125 len(local_sample) = 67,784
[Jan 04, 07:42:30] [0] Creating 32,768 partitions.
[Jan 04, 07:42:30] [0] *Estimated* 7,264,017 embeddings.
[Jan 04, 07:42:30] [0] #> Saving the indexing plan to /root/experiments/test/indexes/dimi.2bits/plan.json ..
Clustering 3242108 points in 128D to 32768 clusters, redo 1 times, 4 iterations
Preprocessing in 0.42 s
Iteration 3 (36.36 s, search 35.39 s): objective=727518 imbalance=1.371 nsplit=0
[Jan 04, 07:43:09] Loading decompress_residuals_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...
Process Process-2:
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1893, in _run_ninja_build
subprocess.run(
File "/opt/conda/lib/python3.10/subprocess.py", line 526, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/root/ColBERT/colbert/infra/launcher.py", line 115, in setup_new_process
return_val = callee(config, *args)
File "/root/ColBERT/colbert/indexing/collection_indexer.py", line 33, in encode
encoder.run(shared_lists)
File "/root/ColBERT/colbert/indexing/collection_indexer.py", line 68, in run
self.train(shared_lists) # Trains centroids from selected passages
File "/root/ColBERT/colbert/indexing/collection_indexer.py", line 229, in train
bucket_cutoffs, bucket_weights, avg_residual = self._compute_avg_residual(centroids, heldout)
File "/root/ColBERT/colbert/indexing/collection_indexer.py", line 307, in _compute_avg_residual
compressor = ResidualCodec(config=self.config, centroids=centroids, avg_residual=None)
File "/root/ColBERT/colbert/indexing/codecs/residual.py", line 24, in __init__
ResidualCodec.try_load_torch_extensions(self.use_gpu)
File "/root/ColBERT/colbert/indexing/codecs/residual.py", line 103, in try_load_torch_extensions
decompress_residuals_cpp = load(
File "/opt/conda/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1284, in load
return _jit_compile(
File "/opt/conda/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1509, in _jit_compile
_write_ninja_file_and_build_library(
File "/opt/conda/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1624, in _write_ninja_file_and_build_library
_run_ninja_build(
File "/opt/conda/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1909, in _run_ninja_build
raise RuntimeError(message) from e
RuntimeError: Error building extension 'decompress_residuals_cpp': [1/3] c++ -MMD -MF decompress_residuals.o.d -DTORCH_EXTENSION_NAME=decompress_residuals_cpp -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /opt/conda/lib/python3.10/site-packages/torch/include -isystem /opt/conda/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/lib/python3.10/site-packages/torch/include/TH -isystem /opt/conda/lib/python3.10/site-packages/torch/include/THC -isystem /opt/conda/include -isystem /opt/conda/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -c /root/ColBERT/colbert/indexing/codecs/decompress_residuals.cpp -o decompress_residuals.o
[2/3] /opt/conda/bin/nvcc -DTORCH_EXTENSION_NAME=decompress_residuals_cpp -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /opt/conda/lib/python3.10/site-packages/torch/include -isystem /opt/conda/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/lib/python3.10/site-packages/torch/include/TH -isystem /opt/conda/lib/python3.10/site-packages/torch/include/THC -isystem /opt/conda/include -isystem /opt/conda/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_37,code=sm_37 -gencode=arch=compute_50,code=sm_50 -gencode=arch=compute_70,code=compute_70 -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -std=c++17 -c /root/ColBERT/colbert/indexing/codecs/decompress_residuals.cu -o decompress_residuals.cuda.o
nvcc warning : The 'compute_35', 'compute_37', 'sm_35', and 'sm_37' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
/root/ColBERT/colbert/indexing/codecs/decompress_residuals.cu: In function ‘at::Tensor decompress_residuals_cuda(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor, int, int)’:
/root/ColBERT/colbert/indexing/codecs/decompress_residuals.cu:61:127: warning: ‘T* at::Tensor::data() const [with T = unsigned char]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
61 | decompress_residuals_kernel<<<blocks, threads>>>(
| ^
/opt/conda/lib/python3.10/site-packages/torch/include/ATen/core/TensorBody.h:244:1: note: declared here
244 | T * data() const {
| ^ ~~
/root/ColBERT/colbert/indexing/codecs/decompress_residuals.cu:61:593: warning: ‘T* at::Tensor::data() const [with T = c10::Half]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
61 | decompress_residuals_kernel<<<blocks, threads>>>(
| ^
/opt/conda/lib/python3.10/site-packages/torch/include/ATen/core/TensorBody.h:244:1: note: declared here
244 | T * data() const {
| ^ ~~
[3/3] c++ decompress_residuals.o decompress_residuals.cuda.o -shared -L/opt/conda/lib/python3.10/site-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda -ltorch -ltorch_python -L/opt/conda/lib64 -lcudart -o decompress_residuals_cpp.so
FAILED: decompress_residuals_cpp.so
c++ decompress_residuals.o decompress_residuals.cuda.o -shared -L/opt/conda/lib/python3.10/site-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda -ltorch -ltorch_python -L/opt/conda/lib64 -lcudart -o decompress_residuals_cpp.so
/usr/bin/ld: cannot find -lcudart
collect2: error: ld returned 1 exit status
ninja: build stopped: subcommand failed.
The main error seems to be:
[3/3] c++ decompress_residuals.o decompress_residuals.cuda.o -shared -L/opt/conda/lib/python3.10/site-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda -ltorch -ltorch_python -L/opt/conda/lib64 -lcudart -o decompress_residuals_cpp.so
FAILED: decompress_residuals_cpp.so
c++ decompress_residuals.o decompress_residuals.cuda.o -shared -L/opt/conda/lib/python3.10/site-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda -ltorch -ltorch_python -L/opt/conda/lib64 -lcudart -o decompress_residuals_cpp.so
/usr/bin/ld: cannot find -lcudart
collect2: error: ld returned 1 exit status
Due to my sagemaker image I'm on python 3.10 and not 3.8 as given in the conda_env.yml
, however I suspect this is not python version related.
Furhter checking my environemt, $CUDA_HOME
is /opt/conda/
which looks fine to me:
(base) root@pytorch-2-0-0-gpu-p-ml-g4dn-xlarge-833749c2ea3c2ae27eceeea34f79:~# ls -al $CUDA_HOME/lib/libcuda*
-rw-rw-r-- 1 root root 1021860 Sep 21 2022 /opt/conda//lib/libcudadevrt.a
lrwxrwxrwx 1 root root 20 May 11 2023 /opt/conda//lib/libcudart.so -> libcudart.so.11.8.89
lrwxrwxrwx 1 root root 20 May 11 2023 /opt/conda//lib/libcudart.so.11.0 -> libcudart.so.11.8.89
-rwxrwxr-x 1 root root 695712 Sep 21 2022 /opt/conda//lib/libcudart.so.11.8.89
-rw-rw-r-- 1 root root 1198880 Sep 21 2022 /opt/conda//lib/libcudart_static.a
Looking at LD_LIBRARY_PATH
:/opt/conda/lib/python3.10/site-packages/smdistributed/dataparallel/lib:/opt/amazon/openmpi/lib/:/opt/amazon/efa/lib/:/opt/conda/lib:/usr/local/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/lib
Any further idea? Thanks!
Solved it, using the conda_env.yml
on sagemaker worked in the end. Good learning for me. Thanks @okhat and @paul7Junior .
Solved it, using the
conda_env.yml
on sagemaker worked in the end. Good learning for me. Thanks @okhat and @paul7Junior . Hi! I wonder how do you resolve the problem at the end? I think I encounter the exact same problem here. Thank you so much!
Solved it, using the
conda_env.yml
on sagemaker worked in the end. Good learning for me. Thanks @okhat and @paul7Junior .
Same issue happened to me. Using the yaml file does not work. Any idea?
Running ColBERT like this:
Gives me this error:
My python environment is build from
conda_env.yml
and the code is up-to-date like this:Root cause seems to be this file:
With this error:
error: a constexpr variable declaration must be a definition
Any idea how to solve this?
Kind regards, André