microsoft / MSMARCO-Conversational-Search

Truly Conversational Search is the next logic step in the journey to generate intelligent and useful AI. To understand what this may mean, researchers have voiced a continuous desire to study how people currently converse with search engines. Traditionally, the desire to produce such a comprehensive dataset has been limited because those who have this data (Search Engines) have a responsibility to their users to maintain their privacy and cannot share the data publicly in a way that upholds the trusts users have in the Search Engines. Given these two powerful forces we believe we have a dataset and paradigm that meets both sets of needs: A artificial public dataset that approximates the true data and an ability to evaluate model performance on the real user behavior. What this means is we released a public dataset which is generated by creating artificial sessions using embedding similarity and will test on the original data. To say this again: we are not releasing any private user data but are releasing what we believe to be a good representation of true user interactions.
https://microsoft.github.io/MSMARCO-Conversational-Search/
MIT License
107 stars 21 forks source link

Scripts provided don't work #1

Closed Ricocotam closed 5 years ago

Ricocotam commented 5 years ago

Hi, I tried to use the provided scripts to download and generate data but the README doesn't match de python files.

README :

python generateQuerySets.py <msmarco train queries> <msmarco dev queries> <msmarco eval queries> <quoraQueries> <NQFolder>

Python File :

python generateQuerysets.py <msmarco train queries> <msmarco dev queries> <msmarco eval queries> <quoraQueries> <NQFolder> <unused msmarco> <sessions>

I supposed I had to use the "generateArtificialSessions.py" file but I need embeddings which are only available once the first file has been executed. It seems there's a circular need.

spacemanidol commented 5 years ago

We released the code more so people could see how we generated the sessions. For privacy reasons we will not be able to release the essions data and the unused sessions. GenerateQuerySets.py just concats together all the query sets mentioned to have a larger semantic space to map the real user sessions.

Ricocotam commented 5 years ago

Alright thanks, it wasn't clear