nnistelrooij / Information-Retrieval

Repository for research project of the 2019 Information Retrieval course at Radboud University in Nijmegen.
0 stars 0 forks source link

Tokenization and Normalization of Queries #12

Closed nnistelrooij closed 4 years ago

nnistelrooij commented 4 years ago

We want to get the appropriate MAP/NDCG scores for the Robust04 data set. One part is tokenizing and normalizing the queries. You can think of:

The queries can be found here and look like:

queryid query
301 International Organized Crime
302 Poliomyelitis and Post-Polio
... ...
700 gasoline tax U.S.

I would prefer the output to be something like this:

queryid term len
301 international 3
301 organized 3
301 crime 3
302 poliomyelitis 3
302 post 3
302 polio 3
303 ... ...
... ... ...
700 us ...
rusane commented 4 years ago

Why do we want to tokenize each query as such? Is there any reason/advantage to store them like that compared to this (normalized without tokenizing):

queryid term
301 international organized crime
302 poliomyelitis post polio
... ...
nnistelrooij commented 4 years ago

As of right now, the retriever first initializes the search query as a database table called query, that looks like this:

term
new
york

Then it retrieves a document ranking with the terms in this table.

Ultimately, we want a huge table that houses a document ranking for each of the Robust04 search queries. So we need to run the retriever for each of the Robust04 search queries with one huge SQL query. Such a table will look something like below. rel is the relevance of the document, given the search query.

queryid docid score rel
301 8383 8.413 1
301 1254 8.21 0
... ... ... ...
302 5432 5.31 0
302 9183 5.209 1
... ... ... ...

Because of the addition of the queryid dimension in the output, we also need this extra queryid dimension in the input. More pragmatically, it helps if the input to make this table is the same as before, so that I do not have to rewrite the Retriever.retrieve() function to work with the new table format.

rusane commented 4 years ago

Ah I see, do you think it is easier to add the relevance judgements during the query execution (or maybe I understood it incorrectly)? Because we could also add them after the ranked list is retrieved (see my comment). In a real search systems the relevances and query id's are probably not present in the database/index, so I thought it would be more realistic to add them outside the system (if possible) for the evaluation.

nnistelrooij commented 4 years ago

Ad hoc retrieval will be done with the Retriever.retrieve() function and Robust04 retrieval will be done with the Retriever.retrieve_all() function, which I am still working on. So yes, the relevance judgments will be added in the Retriever.retrieve_all() function to get the document rankings for all queries alongside their relevances.

Retrieve.retrieve() will be used to evaluate ad hoc retrieval, whereas Retrieve.retrieve_all() will be used to evaluate the performance of the retrieval model(s) on the Robust04 dataset.

nnistelrooij commented 4 years ago

I've implemented the Retriever.retrieve_all() function, see Pull Request #14. Hopefully, you can now implement the evaluation metrics based on the Pandas DataFrame output of that function. The private methods are even more of a mess, but that complexity was necessary to facilitate all the combinations of the options. So just stick to Retriever.retrieve() and Retriever.retrieve_all().

rusane commented 4 years ago

That's great, I'll have a try at it tomorrow or early next week. I already implemented the evaluation metrics in Python (#13), but I'll just have to extract the relevant columns and rows from the Pandas DataFrame as input for those metrics.