mikeizbicki / chajda

5 stars 1 forks source link

Next tasks involving docker #12

Open mikeizbicki opened 3 years ago

mikeizbicki commented 3 years ago

@joeybodoia Currently, we know that we can create wordvectors quickly. There are two more problems for us to tackle:

  1. how does changing the queries affect the overall runtime performance of the database?
  2. how does changing the queries affect the quality of the results (precision/recall)?

In order to measure these quantities, we are going to use 2 standard datasets described in the following papers:

  1. Sogou-QCL: https://dl.acm.org/doi/abs/10.1145/3209978.3210092
  2. LETOR 4.0: https://arxiv.org/abs/1306.2597

There's a lot of other datasets as well, and so while you're working you should try to make the dataset code relatively generic so that we can easily plug in more datasets later if we want.

Each dataset contains three pieces of information:

  1. documents
  2. queries
  3. labels indicating which documents should be ranked higher for which queries

For our first runtime task, you will only need the documents/queries information. Then we will add in the label information once we're done with the runtime benchmarks.

Our goal is to create a Dockerfile and docker-compose.yml files that when run will:

  1. start up a postgresql server
  2. load all of the documents into the server
  3. run a python program that runs all of the queries (with various different settings for the word vectors)
    1. at first we will only measure runtime;
    2. but eventually we will also check for accuracy

Concrete steps to take to make this happen:

  1. You should create a new folder called bench and put all the needed docker/python files in this folder.
  2. Download the datafiles for the two papers and make sure you know how to extract each of the types of information from the dataset. Put the data in the folder bench/data/$DATASETNAME.
  3. Create a file bench/schema.sql that will create a simple table and index on that table; it should look something like
    CREATE EXTENSION plpgpython3u;
    CREATE EXTENSION chajda;
    CREATE TABLE documents (
        id SERIAL PRIMARY KEY,
        title TEXT NOT NULL,
        content TEXT NOT NULL
    );
    CREATE INDEX ON documents USING gin(chajda_tsquery('en',doc));

    You won't have to do anything too fancy with sql, but we will be using it quite a bit for this part of the project. I can definitely provide lots of guidance here.

  4. Modify the Dockerfile so that the schema.sql file is loaded by postgres. You can use the existing Dockerfile in the project root as a guide. The psql command is how you send files to the postgres server to get executed.
  5. Use the COPY postgresql command to load the documents into the table (again, this will be done with the psql command)
  6. Create a python file that connects to postgres and issues SELECT queries to get the results. (I think it'll be a while before you get to this step, so I'll provide more details once you get here.)
mikeizbicki commented 3 years ago

To check the number of rows in the table:

SELECT count(*) FROM documents;

Convert from xml to csv using the lxml library in python; then use the COPY command to load the CSV.