microsoft / MSMARCO-Conversational-Search

Truly Conversational Search is the next logic step in the journey to generate intelligent and useful AI. To understand what this may mean, researchers have voiced a continuous desire to study how people currently converse with search engines. Traditionally, the desire to produce such a comprehensive dataset has been limited because those who have this data (Search Engines) have a responsibility to their users to maintain their privacy and cannot share the data publicly in a way that upholds the trusts users have in the Search Engines. Given these two powerful forces we believe we have a dataset and paradigm that meets both sets of needs: A artificial public dataset that approximates the true data and an ability to evaluate model performance on the real user behavior. What this means is we released a public dataset which is generated by creating artificial sessions using embedding similarity and will test on the original data. To say this again: we are not releasing any private user data but are releasing what we believe to be a good representation of true user interactions.
https://microsoft.github.io/MSMARCO-Conversational-Search/
MIT License
107 stars 21 forks source link

Some queries in sessions don't appear in MS-Marco train #3

Closed Ricocotam closed 5 years ago

Ricocotam commented 5 years ago

Hi, I'm willing to retrieve the queries ids for each query in each session. I tried an exact matching algorithm and it fails retrieving 9M queries which corresponds to 16k unique queries and 8M sessions.

I looked a bit in the queries that didn't match. I found some that were truncated and thus are quite easy to retrieve, and the rest I tested was just not present at all. When I searched for the missing queries I looked without the case so no problem with low or capital letters.

How can this be classified as a bug or a feature ?

Here is a code similar to the one I used :

from collections import defaultdict

sessions_path = ""
queries_path = ""

q_id = defaultdict(lambda : -1)  # Store query -> qid
with open(queries_path, "r") as f:
    for line in f:
        qid, query = line.strip().split("\t")
        words = tuple(query.split(" "))
        q_id[words] = qid

sess_ids = {} 
nb_fails, sess_fails = 0, 0
with open(sessions_path, "r") as f, open("output", "w") as wri: 
    for line in f: 
        sess_id, *queries = line.strip().split("\t") 
        qids = [] 
        temp_fail = [] 
        sess_fail = False
        for query in queries: 
            q = tuple(query.split(" ")) 
            qids.append(q_id[q]) 
            if q_id[q] == -1: 
                wri.write(f"{q} \n") 
                nb_fails += 1
                sess_fail = True
        sess_ids[sess_id] = qids
        nb_sess_fail += 1 
spacemanidol commented 5 years ago

If you are looking for queryIDs I think its because you are trying to use some aspects of the Passage Ranking dataset in this dataset. For ease of use please treat them as separates and thus each may have some normalization that makes joining them dificult.

Ricocotam commented 5 years ago

The paper says they are not separate thus we should be able to retrieve the ids. I actually need the query ids for my work. I have enough of them but it is strange not to be able to find them all since Conversational Search dataset should only use Passage Ranking dataset's queries (from what I understood)