terrierteam / pyterrier_colbert

83 stars 35 forks source link

Performance on Robust04 #60

Open andreabac3 opened 1 year ago

andreabac3 commented 1 year ago

Hi @cmacdonald,

After the pull request, the code run perfectly. However, I have some performance issue.

Requirements: python-terrier==0.9.1 faiss-gpu==1.6.5 pyterrier-colbert==0.0.1

To create the index I am executing the following code:


checkpoint = "http://www.dcs.gla.ac.uk/~craigm/colbert.dnn.zip"
indexer = ColBERTIndexer(checkpoint, "./index_robust04", "my_index/", chunksize=3, skip_empty_docs=True)
dataset = pt.get_dataset('irds:disks45/nocr/trec-robust-2004')
indexer.index(dataset.get_corpus_iter())

With the following output

#> Sample has shape (4352839, 128)
[feb 10, 15:32:26] Preparing resources for 1 GPUs.
[feb 10, 15:32:26] #> Training with the vectors...
[feb 10, 15:32:26] #> Training now (using 1 GPUs)...
0.06014108657836914
11.042617559432983
0.0002636909484863281
[feb 10, 15:32:37] Done training!

[feb 10, 15:32:37] #> Indexing the vectors...
[feb 10, 15:32:37] #> Loading ('./index_robust04/my_index/0.pt', './index_robust04/my_index/1.pt', './index_robust04/my_index/2.pt') (from queue)...
[feb 10, 15:32:43] #> Processing a sub_collection with shape (36038509, 128)
[feb 10, 15:32:43] Add data with shape (36038509, 128) (offset = 0)..
  IndexIVFPQ size 0 -> GpuIndexIVFPQ indicesOptions=0 usePrecomputed=0 useFloat16=1 reserveVecs=33554432
33488896/36038509 (25.997 s)   Flush indexes to CPU
35979264/36038509 (28.914 s)   Flush indexes to CPU
add(.) time: 29.045 s           --               index.ntotal = 36038509
[feb 10, 15:33:12] #> Loading ('./index_robust04/my_index/3.pt', './index_robust04/my_index/4.pt', './index_robust04/my_index/5.pt') (from queue)...
[feb 10, 15:33:13] #> Processing a sub_collection with shape (33680999, 128)
[feb 10, 15:33:13] Add data with shape (33680999, 128) (offset = 36038509)..
33488896/33680999 (25.242 s)   Flush indexes to CPU
33619968/33680999 (26.493 s)   Flush indexes to CPU
add(.) time: 26.553 s           --               index.ntotal = 69719508
[feb 10, 15:33:39] #> Loading ('./index_robust04/my_index/6.pt', './index_robust04/my_index/7.pt', None) (from queue)...
[feb 10, 15:33:40] #> Processing a sub_collection with shape (17337319, 128)
[feb 10, 15:33:40] Add data with shape (17337319, 128) (offset = 69719508)..
17301504/17337319 (12.993 s)   Flush indexes to CPU
add(.) time: 13.636 s           --               index.ntotal = 87056827
[feb 10, 15:33:54] Done indexing!
[feb 10, 15:33:54] Writing index to ./index_robust04/my_index/ivfpq.100.faiss ...
[feb 10, 15:33:55]

Done! All complete (for slice #1 of 1)!
#> Faiss encoding complete
#> Indexing complete, Time elapsed 1143.59 seconds

Then I renamed ivfpq.100.faiss to ivfpq.faiss, otherwise the codebase crashes.

And I tried to execute some exps with the following code:

from pyterrier_colbert.ranking import ColBERTFactory
checkpoint = "http://www.dcs.gla.ac.uk/~craigm/colbert.dnn.zip"
pytcolbert = ColBERTFactory(checkpoint,"./robust", "my_index/")
dense_e2e = pytcolbert.end_to_end()
res = pt.Experiment(
  [BM25, dense_e2e],
  queries,
  qrels,
  eval_metrics=["ndcg_cut_10", "recall"],
  names=["BM25", "Dense ColBERT"],
)

However the Dense ColBERT results are the following.

            name  ndcg_cut_10       R@5      R@10     R@15      R@20      R@30     R@100     R@200     R@500    R@1000
0           BM25     0.434104  0.086331  0.140303  0.18080  0.206941  0.249246  0.405284  0.492415  0.610984  0.689337
1  Dense ColBERT     0.062902  0.008783  0.014293  0.01824  0.020652  0.025027  0.041959  0.053847  0.073959  0.088344

Can you help me with this problem?

Thanks in advance, Andrea

cmacdonald commented 1 year ago

Hi Andrea,

ColBERT is limited to 180 wordpiece tokens per document. The robust04 documents are much longer than that. You need to apply some passaging.

Xiao describes this in: https://dl.acm.org/doi/10.1145/3572405

Craig

cmacdonald commented 1 year ago

Then I renamed ivfpq.100.faiss to ivfpq.faiss, otherwise the codebase crashes.

Long-standing pain. PRs accepted graciously!

talk2much commented 6 months ago

你好@cmacdonald,

拉取请求后,代码完美运行。 但是,我有一些性能问题。

要求: python-terrier==0.9.1 faiss-gpu==1.6.5 pyterrier-colbert==0.0.1

为了创建索引,我正在执行以下代码:

checkpoint = "http://www.dcs.gla.ac.uk/~craigm/colbert.dnn.zip"
indexer = ColBERTIndexer(checkpoint, "./index_robust04", "my_index/", chunksize=3, skip_empty_docs=True)
dataset = pt.get_dataset('irds:disks45/nocr/trec-robust-2004')
indexer.index(dataset.get_corpus_iter())

具有以下输出

#> Sample has shape (4352839, 128)
[feb 10, 15:32:26] Preparing resources for 1 GPUs.
[feb 10, 15:32:26] #> Training with the vectors...
[feb 10, 15:32:26] #> Training now (using 1 GPUs)...
0.06014108657836914
11.042617559432983
0.0002636909484863281
[feb 10, 15:32:37] Done training!

[feb 10, 15:32:37] #> Indexing the vectors...
[feb 10, 15:32:37] #> Loading ('./index_robust04/my_index/0.pt', './index_robust04/my_index/1.pt', './index_robust04/my_index/2.pt') (from queue)...
[feb 10, 15:32:43] #> Processing a sub_collection with shape (36038509, 128)
[feb 10, 15:32:43] Add data with shape (36038509, 128) (offset = 0)..
  IndexIVFPQ size 0 -> GpuIndexIVFPQ indicesOptions=0 usePrecomputed=0 useFloat16=1 reserveVecs=33554432
33488896/36038509 (25.997 s)   Flush indexes to CPU
35979264/36038509 (28.914 s)   Flush indexes to CPU
add(.) time: 29.045 s           --               index.ntotal = 36038509
[feb 10, 15:33:12] #> Loading ('./index_robust04/my_index/3.pt', './index_robust04/my_index/4.pt', './index_robust04/my_index/5.pt') (from queue)...
[feb 10, 15:33:13] #> Processing a sub_collection with shape (33680999, 128)
[feb 10, 15:33:13] Add data with shape (33680999, 128) (offset = 36038509)..
33488896/33680999 (25.242 s)   Flush indexes to CPU
33619968/33680999 (26.493 s)   Flush indexes to CPU
add(.) time: 26.553 s           --               index.ntotal = 69719508
[feb 10, 15:33:39] #> Loading ('./index_robust04/my_index/6.pt', './index_robust04/my_index/7.pt', None) (from queue)...
[feb 10, 15:33:40] #> Processing a sub_collection with shape (17337319, 128)
[feb 10, 15:33:40] Add data with shape (17337319, 128) (offset = 69719508)..
17301504/17337319 (12.993 s)   Flush indexes to CPU
add(.) time: 13.636 s           --               index.ntotal = 87056827
[feb 10, 15:33:54] Done indexing!
[feb 10, 15:33:54] Writing index to ./index_robust04/my_index/ivfpq.100.faiss ...
[feb 10, 15:33:55]

Done! All complete (for slice #1 of 1)!
#> Faiss encoding complete
#> Indexing complete, Time elapsed 1143.59 seconds

然后我将 ivfpq.100.faiss 重命名为 ivfpq.faiss,否则代码库会崩溃。

我尝试使用以下代码执行一些exp:

from pyterrier_colbert.ranking import ColBERTFactory
checkpoint = "http://www.dcs.gla.ac.uk/~craigm/colbert.dnn.zip"
pytcolbert = ColBERTFactory(checkpoint,"./robust", "my_index/")
dense_e2e = pytcolbert.end_to_end()
res = pt.Experiment(
  [BM25, dense_e2e],
  queries,
  qrels,
  eval_metrics=["ndcg_cut_10", "recall"],
  names=["BM25", "Dense ColBERT"],
)

然而,Dense ColBERT 结果如下。

            name  ndcg_cut_10       R@5      R@10     R@15      R@20      R@30     R@100     R@200     R@500    R@1000
0           BM25     0.434104  0.086331  0.140303  0.18080  0.206941  0.249246  0.405284  0.492415  0.610984  0.689337
1  Dense ColBERT     0.062902  0.008783  0.014293  0.01824  0.020652  0.025027  0.041959  0.053847  0.073959  0.088344

你能帮我解决这个问题吗?

预先感谢, 安德里亚

Hello, the robust04 document is too long, so the problem will be caused. I changed max_token to 500 for encoding, but my computer has problems recently, and the program cannot run. If you run the code of max_token=500, could you please send the final ivfpq.faiss to me