Open andreabac3 opened 1 year ago
Hi Andrea,
ColBERT is limited to 180 wordpiece tokens per document. The robust04 documents are much longer than that. You need to apply some passaging.
Xiao describes this in: https://dl.acm.org/doi/10.1145/3572405
Craig
Then I renamed ivfpq.100.faiss to ivfpq.faiss, otherwise the codebase crashes.
Long-standing pain. PRs accepted graciously!
你好@cmacdonald,
拉取请求后,代码完美运行。 但是,我有一些性能问题。
要求: python-terrier==0.9.1 faiss-gpu==1.6.5 pyterrier-colbert==0.0.1
为了创建索引,我正在执行以下代码:
checkpoint = "http://www.dcs.gla.ac.uk/~craigm/colbert.dnn.zip" indexer = ColBERTIndexer(checkpoint, "./index_robust04", "my_index/", chunksize=3, skip_empty_docs=True) dataset = pt.get_dataset('irds:disks45/nocr/trec-robust-2004') indexer.index(dataset.get_corpus_iter())
具有以下输出
#> Sample has shape (4352839, 128) [feb 10, 15:32:26] Preparing resources for 1 GPUs. [feb 10, 15:32:26] #> Training with the vectors... [feb 10, 15:32:26] #> Training now (using 1 GPUs)... 0.06014108657836914 11.042617559432983 0.0002636909484863281 [feb 10, 15:32:37] Done training! [feb 10, 15:32:37] #> Indexing the vectors... [feb 10, 15:32:37] #> Loading ('./index_robust04/my_index/0.pt', './index_robust04/my_index/1.pt', './index_robust04/my_index/2.pt') (from queue)... [feb 10, 15:32:43] #> Processing a sub_collection with shape (36038509, 128) [feb 10, 15:32:43] Add data with shape (36038509, 128) (offset = 0).. IndexIVFPQ size 0 -> GpuIndexIVFPQ indicesOptions=0 usePrecomputed=0 useFloat16=1 reserveVecs=33554432 33488896/36038509 (25.997 s) Flush indexes to CPU 35979264/36038509 (28.914 s) Flush indexes to CPU add(.) time: 29.045 s -- index.ntotal = 36038509 [feb 10, 15:33:12] #> Loading ('./index_robust04/my_index/3.pt', './index_robust04/my_index/4.pt', './index_robust04/my_index/5.pt') (from queue)... [feb 10, 15:33:13] #> Processing a sub_collection with shape (33680999, 128) [feb 10, 15:33:13] Add data with shape (33680999, 128) (offset = 36038509).. 33488896/33680999 (25.242 s) Flush indexes to CPU 33619968/33680999 (26.493 s) Flush indexes to CPU add(.) time: 26.553 s -- index.ntotal = 69719508 [feb 10, 15:33:39] #> Loading ('./index_robust04/my_index/6.pt', './index_robust04/my_index/7.pt', None) (from queue)... [feb 10, 15:33:40] #> Processing a sub_collection with shape (17337319, 128) [feb 10, 15:33:40] Add data with shape (17337319, 128) (offset = 69719508).. 17301504/17337319 (12.993 s) Flush indexes to CPU add(.) time: 13.636 s -- index.ntotal = 87056827 [feb 10, 15:33:54] Done indexing! [feb 10, 15:33:54] Writing index to ./index_robust04/my_index/ivfpq.100.faiss ... [feb 10, 15:33:55] Done! All complete (for slice #1 of 1)! #> Faiss encoding complete #> Indexing complete, Time elapsed 1143.59 seconds
然后我将 ivfpq.100.faiss 重命名为 ivfpq.faiss,否则代码库会崩溃。
我尝试使用以下代码执行一些exp:
from pyterrier_colbert.ranking import ColBERTFactory checkpoint = "http://www.dcs.gla.ac.uk/~craigm/colbert.dnn.zip" pytcolbert = ColBERTFactory(checkpoint,"./robust", "my_index/") dense_e2e = pytcolbert.end_to_end() res = pt.Experiment( [BM25, dense_e2e], queries, qrels, eval_metrics=["ndcg_cut_10", "recall"], names=["BM25", "Dense ColBERT"], )
然而,Dense ColBERT 结果如下。
name ndcg_cut_10 R@5 R@10 R@15 R@20 R@30 R@100 R@200 R@500 R@1000 0 BM25 0.434104 0.086331 0.140303 0.18080 0.206941 0.249246 0.405284 0.492415 0.610984 0.689337 1 Dense ColBERT 0.062902 0.008783 0.014293 0.01824 0.020652 0.025027 0.041959 0.053847 0.073959 0.088344
你能帮我解决这个问题吗?
预先感谢, 安德里亚
Hello, the robust04 document is too long, so the problem will be caused. I changed max_token to 500 for encoding, but my computer has problems recently, and the program cannot run. If you run the code of max_token=500, could you please send the final ivfpq.faiss to me
Hi @cmacdonald,
After the pull request, the code run perfectly. However, I have some performance issue.
Requirements: python-terrier==0.9.1 faiss-gpu==1.6.5 pyterrier-colbert==0.0.1
To create the index I am executing the following code:
With the following output
Then I renamed ivfpq.100.faiss to ivfpq.faiss, otherwise the codebase crashes.
And I tried to execute some exps with the following code:
However the Dense ColBERT results are the following.
Can you help me with this problem?
Thanks in advance, Andrea