Truly Conversational Search is the next logic step in the journey to generate intelligent and useful AI. To understand what this may mean, researchers have voiced a continuous desire to study how people currently converse with search engines. Traditionally, the desire to produce such a comprehensive dataset has been limited because those who have this data (Search Engines) have a responsibility to their users to maintain their privacy and cannot share the data publicly in a way that upholds the trusts users have in the Search Engines. Given these two powerful forces we believe we have a dataset and paradigm that meets both sets of needs: A artificial public dataset that approximates the true data and an ability to evaluate model performance on the real user behavior. What this means is we released a public dataset which is generated by creating artificial sessions using embedding similarity and will test on the original data. To say this again: we are not releasing any private user data but are releasing what we believe to be a good representation of true user interactions.
I supposed I had to use the "generateArtificialSessions.py" file but I need embeddings which are only available once the first file has been executed. It seems there's a circular need.
We released the code more so people could see how we generated the sessions. For privacy reasons we will not be able to release the essions data and the unused sessions. GenerateQuerySets.py just concats together all the query sets mentioned to have a larger semantic space to map the real user sessions.
Hi, I tried to use the provided scripts to download and generate data but the README doesn't match de python files.
README :
Python File :
I supposed I had to use the "generateArtificialSessions.py" file but I need embeddings which are only available once the first file has been executed. It seems there's a circular need.