There's a lot of other datasets as well, and so while you're working you should try to make the dataset code relatively generic so that we can easily plug in more datasets later if we want.
Each dataset contains three pieces of information:
documents
queries
labels indicating which documents should be ranked higher for which queries
For our first runtime task, you will only need the documents/queries information. Then we will add in the label information once we're done with the runtime benchmarks.
Our goal is to create a Dockerfile and docker-compose.yml files that when run will:
start up a postgresql server
load all of the documents into the server
run a python program that runs all of the queries (with various different settings for the word vectors)
at first we will only measure runtime;
but eventually we will also check for accuracy
Concrete steps to take to make this happen:
You should create a new folder called bench and put all the needed docker/python files in this folder.
Download the datafiles for the two papers and make sure you know how to extract each of the types of information from the dataset. Put the data in the folder bench/data/$DATASETNAME.
Create a file bench/schema.sql that will create a simple table and index on that table; it should look something like
CREATE EXTENSION plpgpython3u;
CREATE EXTENSION chajda;
CREATE TABLE documents (
id SERIAL PRIMARY KEY,
title TEXT NOT NULL,
content TEXT NOT NULL
);
CREATE INDEX ON documents USING gin(chajda_tsquery('en',doc));
You won't have to do anything too fancy with sql, but we will be using it quite a bit for this part of the project. I can definitely provide lots of guidance here.
Modify the Dockerfile so that the schema.sql file is loaded by postgres. You can use the existing Dockerfile in the project root as a guide. The psql command is how you send files to the postgres server to get executed.
Use the COPY postgresql command to load the documents into the table (again, this will be done with the psql command)
Create a python file that connects to postgres and issues SELECT queries to get the results. (I think it'll be a while before you get to this step, so I'll provide more details once you get here.)
@joeybodoia Currently, we know that we can create wordvectors quickly. There are two more problems for us to tackle:
In order to measure these quantities, we are going to use 2 standard datasets described in the following papers:
There's a lot of other datasets as well, and so while you're working you should try to make the dataset code relatively generic so that we can easily plug in more datasets later if we want.
Each dataset contains three pieces of information:
For our first runtime task, you will only need the documents/queries information. Then we will add in the label information once we're done with the runtime benchmarks.
Our goal is to create a
Dockerfile
anddocker-compose.yml
files that when run will:Concrete steps to take to make this happen:
bench
and put all the needed docker/python files in this folder.bench/data/$DATASETNAME
.bench/schema.sql
that will create a simple table and index on that table; it should look something likeYou won't have to do anything too fancy with sql, but we will be using it quite a bit for this part of the project. I can definitely provide lots of guidance here.
Dockerfile
so that theschema.sql
file is loaded by postgres. You can use the existingDockerfile
in the project root as a guide. Thepsql
command is how you send files to the postgres server to get executed.COPY
postgresql command to load the documents into the table (again, this will be done with thepsql
command)SELECT
queries to get the results. (I think it'll be a while before you get to this step, so I'll provide more details once you get here.)