Some queries in sessions don't appear in MS-Marco train

microsoft / MSMARCO-Conversational-Search

Truly Conversational Search is the next logic step in the journey to generate intelligent and useful AI. To understand what this may mean, researchers have voiced a continuous desire to study how people currently converse with search engines. Traditionally, the desire to produce such a comprehensive dataset has been limited because those who have this data (Search Engines) have a responsibility to their users to maintain their privacy and cannot share the data publicly in a way that upholds the trusts users have in the Search Engines. Given these two powerful forces we believe we have a dataset and paradigm that meets both sets of needs: A artificial public dataset that approximates the true data and an ability to evaluate model performance on the real user behavior. What this means is we released a public dataset which is generated by creating artificial sessions using embedding similarity and will test on the original data. To say this again: we are not releasing any private user data but are releasing what we believe to be a good representation of true user interactions.

MIT License

107 stars 21 forks source link

Hi, I'm willing to retrieve the queries ids for each query in each session. I tried an exact matching algorithm and it fails retrieving 9M queries which corresponds to 16k unique queries and 8M sessions.

I looked a bit in the queries that didn't match. I found some that were truncated and thus are quite easy to retrieve, and the rest I tested was just not present at all. When I searched for the missing queries I looked without the case so no problem with low or capital letters.

How can this be classified as a bug or a feature ?

Here is a code similar to the one I used :

from collections import defaultdict

sessions_path = ""
queries_path = ""

q_id = defaultdict(lambda : -1)  # Store query -> qid
with open(queries_path, "r") as f:
    for line in f:
        qid, query = line.strip().split("\t")
        words = tuple(query.split(" "))
        q_id[words] = qid

sess_ids = {} 
nb_fails, sess_fails = 0, 0
with open(sessions_path, "r") as f, open("output", "w") as wri: 
    for line in f: 
        sess_id, *queries = line.strip().split("\t") 
        qids = [] 
        temp_fail = [] 
        sess_fail = False
        for query in queries: 
            q = tuple(query.split(" ")) 
            qids.append(q_id[q]) 
            if q_id[q] == -1: 
                wri.write(f"{q} \n") 
                nb_fails += 1
                sess_fail = True
        sess_ids[sess_id] = qids
        nb_sess_fail += 1

microsoft / MSMARCO-Conversational-Search

Some queries in sessions don't appear in MS-Marco train #3