Tokenization and Normalization of Queries

nnistelrooij / Information-Retrieval

Repository for research project of the 2019 Information Retrieval course at Radboud University in Nijmegen.

0 stars 0 forks source link

Tokenization and Normalization of Queries #12

Closed nnistelrooij closed 4 years ago

nnistelrooij commented 4 years ago

We want to get the appropriate MAP/NDCG scores for the Robust04 data set. One part is tokenizing and normalizing the queries. You can think of:

decapitalization;
removing stop words;
stemming;
do something with the punctuation.

The queries can be found here and look like:

queryid	query
301	International Organized Crime
302	Poliomyelitis and Post-Polio
...	...
700	gasoline tax U.S.

I would prefer the output to be something like this:

queryid	term	len
301	international	3
301	organized	3
301	crime	3
302	poliomyelitis	3
302	post	3
302	polio	3
303	...	...
...	...	...
700	us	...

rusane commented 4 years ago

Why do we want to tokenize each query as such? Is there any reason/advantage to store them like that compared to this (normalized without tokenizing):

queryid	term
301	international organized crime
302	poliomyelitis post polio
...	...

nnistelrooij commented 4 years ago

As of right now, the retriever first initializes the search query as a database table called query, that looks like this:

term
new
york

Then it retrieves a document ranking with the terms in this table.

Ultimately, we want a huge table that houses a document ranking for each of the Robust04 search queries. So we need to run the retriever for each of the Robust04 search queries with one huge SQL query. Such a table will look something like below. rel is the relevance of the document, given the search query.

queryid	docid	score	rel
301	8383	8.413	1
301	1254	8.21	0
...	...	...	...
302	5432	5.31	0
302	9183	5.209	1
...	...	...	...

Because of the addition of the queryid dimension in the output, we also need this extra queryid dimension in the input. More pragmatically, it helps if the input to make this table is the same as before, so that I do not have to rewrite the Retriever.retrieve() function to work with the new table format.

rusane commented 4 years ago

Ah I see, do you think it is easier to add the relevance judgements during the query execution (or maybe I understood it incorrectly)? Because we could also add them after the ranked list is retrieved (see my comment). In a real search systems the relevances and query id's are probably not present in the database/index, so I thought it would be more realistic to add them outside the system (if possible) for the evaluation.

nnistelrooij commented 4 years ago

Ad hoc retrieval will be done with the Retriever.retrieve() function and Robust04 retrieval will be done with the Retriever.retrieve_all() function, which I am still working on. So yes, the relevance judgments will be added in the Retriever.retrieve_all() function to get the document rankings for all queries alongside their relevances.

Retrieve.retrieve() will be used to evaluate ad hoc retrieval, whereas Retrieve.retrieve_all() will be used to evaluate the performance of the retrieval model(s) on the Robust04 dataset.

nnistelrooij commented 4 years ago

I've implemented the Retriever.retrieve_all() function, see Pull Request #14. Hopefully, you can now implement the evaluation metrics based on the Pandas DataFrame output of that function. The private methods are even more of a mess, but that complexity was necessary to facilitate all the combinations of the options. So just stick to Retriever.retrieve() and Retriever.retrieve_all().

rusane commented 4 years ago

That's great, I'll have a try at it tomorrow or early next week. I already implemented the evaluation metrics in Python (#13), but I'll just have to extract the relevant columns and rows from the Pandas DataFrame as input for those metrics.