pisa-engine / pisa

PISA: Performant Indexes and Search for Academia
https://pisa-engine.github.io/pisa/book
Apache License 2.0
938 stars 65 forks source link

Document query format #245

Closed elshize closed 4 years ago

elshize commented 5 years ago

The input query format should be documented in the docs.

ansariyusuf commented 5 years ago

I want to convert TREC 2013 queries in to PISA Query format. As told, I tried to use extract_topics to convert TREC queries. But, I am getting the following error: terminate called after throwing an instance of 'std::runtime_error' what(): Could not consume tag: Aborted (core dumped)

I have tried passing a text file containing TREC 2013 queries (format: :query). I also passed TREC 2103 topics (XML file) as an input to ./extract_topics bu I got the same error message. Can you help me with this issue?

elshize commented 5 years ago

a text file containing TREC 2013 queries (format: :query)

Can you show one line from this file?

This type of queries should just go directly to the queries or evaluate_queries programs. extract_topics is designed to be used with this type of files: https://trec.nist.gov/data/terabyte/04/04topics.701-750.txt

Can you attach your XML file (or a snippet)?

ansariyusuf commented 5 years ago

Following is the snippet of my text file: 201:raspberry pi https://trec.nist.gov/data/web/2013/web2013.topics.txt

Following is the snippet of XML file: https://trec.nist.gov/data/web/2013/trec2013-topics.xml

<webtrack2013>
<!-- Please note that topic and subtopic types (faceted/ambiguous,
     inf/nav are meant as a general indicator and should not be taken
     as definitive aspects of the query intent. -->

<!-- Note that the first subtopic is always identical to the description
     sentence.  This is to ensure that adhoc-task results are also relevant
     to the subtopic task. -->

<topic number="201" type="faceted">
  <query>raspberry pi</query>
  <description>
    What is a raspberry pi?
  </description>
  <subtopic number="1" type="inf">
    What is a raspberry pi?
  </subtopic>
  <subtopic number="2" type="inf">
    What software does a raspberry pi use?
  </subtopic>
  <subtopic number="3" type="inf">
    What are hardware options for a raspberry pi?
  </subtopic>
  <subtopic number="4" type="nav">
    How much does a basic raspberry pi cost?
  </subtopic>
  <subtopic number="5" type="inf">
    Find info about the raspberry pi foundation.
  </subtopic>
  <subtopic number="6" type="nav">
    Find a picture of a raspberry pi.
  </subtopic>
</topic>
</webtrack2013>

I tried passing passing the text file directly to the ./queries program. but, I get a bunch of warnings: image

elshize commented 5 years ago

parse_collection program should produce a *.termlex file. You need to pass this file to --terms argument when calling queries.

ansariyusuf commented 5 years ago

I did as you asked. Now, the ./queries program is segfaulting: image

elshize commented 5 years ago

Ok, this doesn't seem to be related to parsing queries anymore. Can you compile it in Debug and post the stack trace from gdb?

ansariyusuf commented 5 years ago

I compiled with flag "-g" and then ran gdb. From gdb, I ran ./queries and got the following output: image image image

elshize commented 5 years ago

There are two more tests we can run here that might help us understand what's happening:

  1. Run create_freq_index again, and make sure that you run it with the same codec --- maybe the file is corrupted or you created a different type of index by accident; that would explain what happened.
  2. If (1) doesn't help, then compress the index with another codec, say, block_simdbp, and run the query to see if it fails or not. The problem seems to originate from the Elias-Fano-specific code.

If you could do these two things, that would definitely help out with finding the problem.

ansariyusuf commented 5 years ago

As requested, I ran create_freq_index again and this time I made the index type as block_optpfor. After, creating the index, I ran the queries program. It segfaulted, I have attached the backtrace from gdb: image

I have repeated the entire process with index type pefopt and when I run ./queries I get a segfault. Please advice me what I should do to resolve this issue. Thank you!

elshize commented 5 years ago

Ok, one more question before I investigate further: are you on master right now? And if not, then which commit are you on?

ansariyusuf commented 5 years ago

I think I am on master: image

elshize commented 5 years ago

Can you run git status?

ansariyusuf commented 5 years ago

Following is the output of git status: image

elshize commented 5 years ago

I'm running some tests to see if I can reproduce it. I'll get back to you when I get some results.

ansariyusuf commented 5 years ago

If you need the collection that I am trying to parse and run queries on, do let me know. I will share it

JMMackenzie commented 4 years ago

@elshize @ansariyusuf Did either if you end up figuring this one out?

elshize commented 4 years ago

Nope, I never got to the bottom of this, @ansariyusuf have you ever fixed it, what happened to this? I'm sorry I didn't get back to you, I have been busy and this one just slipped my mind. If the problem still persist, let's tackled it!

ansariyusuf commented 4 years ago

I did a fresh installation of PISA on a different machine and followed the steps as told to me by @elshize and it happened to work.

elshize commented 4 years ago

In that case, let's close this one, feel free to open it if you experience similar problems in the future.