sbintuitions / JMTEB

The evaluation scripts of JMTEB (Japanese Massive Text Embedding Benchmark)
Creative Commons Attribution Share Alike 4.0 International
24 stars 4 forks source link

[bug] retrieverタスクでmulti-GPU使用時にCUDA OOMが発生する #50

Closed akiFQC closed 1 month ago

akiFQC commented 1 month ago

Mr. Tydiの評価時に下記のようなエラーが出る。 条件

**ank7]:     metrics = evaluator(text_embedder, cache_dir=cache_dir, overwrite_cache=overwrite_cache)
[rank7]:   File "/app/src/jmteb/evaluators/retrieval/evaluator.py", line 118, in __call__
[rank7]:     val_results[dist_name], _ = self._compute_metrics(
[rank7]:   File "/app/src/jmteb/evaluators/retrieval/evaluator.py", line 164, in _compute_metrics
[rank7]:     similarity = dist_func(query_embeddings, doc_embeddings_chunk)
[rank7]:   File "/app/src/jmteb/evaluators/retrieval/evaluator.py", line 301, in euclidean_distance
[rank7]:     return 100 / (torch.cdist(e1, e2) + 1e-4)
[rank7]:   File "/usr/local/lib/python3.10/dist-packages/torch/functional.py", line 1335, in cdist
[rank7]:     return _VF.cdist(x1, x2, p, None)  # type: ignore[attr-defined]
[rank7]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 15.27 GiB. GPU ^G has a total capacity of 79.15 GiB of which 14.70 GiB is free. Including non-PyTorch memory, this process has 64.46 GiB memory in use. Of the allocated memory 31.60 GiB is allocated by PyTorch, and 30.52 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)**