terrierteam / pyterrier_doc2query

35 stars 9 forks source link

Issues with Fetching Queries/Scores from Store #9

Closed basnetsoyuj closed 1 year ago

basnetsoyuj commented 1 year ago

I am just trying to fetch the pre-computed queries and scores.

When I try to run the following:

import pyterrier as pt; pt.init()
from pyterrier_doc2query import Doc2QueryStore

store = Doc2QueryStore.from_repo('https://huggingface.co/datasets/macavaney/d2q-msmarco-passage')
print(store.lookup('100'))

I get the following error:

PyTerrier 0.9.2 has loaded Terrier 5.7 (built by craigm on 2022-11-10 18:30) and terrier-helper 0.0.7

No etc/terrier.properties, using terrier.default.properties for bootstrap configuration.
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/home/soyuj/improving-learned-index/src/doc2query--/utils.py", line 9
    print(store.lookup('100'))
          ^^^^^^^^^^^^^^^^^^^
  File "/home/soyuj/.conda/envs/sjb/lib/python3.11/site-packages/pyterrier_doc2query/stores.py", line 60, in lookup
    queries, q_offsets, docnos_lookup = self.payload()
                                        ^^^^^^^^^^^^^^
  File "/home/soyuj/.conda/envs/sjb/lib/python3.11/site-packages/pyterrier_doc2query/stores.py", line 28, in payload
    self._queries_offsets = np.memmap(self.path/'queries.offsets.u8', mode='r', dtype=np.uint64)
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/soyuj/.conda/envs/sjb/lib/python3.11/site-packages/numpy/core/memmap.py", line 240, in __new__
    raise ValueError("Size of available data is not a "
ValueError: Size of available data is not a multiple of the data-type size.

When I looked into the details, I found that it is because the repo being cloned has a Git LFS file, and the pointer file is being downloaded instead: image

cmacdonald commented 1 year ago

do you have git-lfs installed?

basnetsoyuj commented 1 year ago

Thank you for the response. Yes, after realizing the cause of the error, I downloaded Git LFS and it fixed the error:

subprocess.run(["git", "lfs", "install", "--skip-repo"]) # requires running only once but for completeness
queries_store = Doc2QueryStore.from_repo(queries_repo)

My intention for raising this issue was to highlight that the error message does not directly indicate the requirement of Git LFS (if not installed) which might be a potential area of confusion.

cmacdonald commented 1 year ago

hi @basnetsoyuj thanks for the report - we added an assertion to check for git-lfs support.

basnetsoyuj commented 1 year ago

Cool!