add indexes to improve query performance

fgregg commented 5 months ago

The csv download&_size=max&_dl=1) for this query) is timing out.

Here's the explain query plan.

Let's add some indexes with sqlite-utils to improve the query performance.

Build the database
Run datasette locally with a timeout setting of --setting sql_time_limit_ms 100000
Run query locally.
Add indexes to improve query performance
Choose which indexes to add and add them to the Makefile with sqlite-utils

hancush commented 5 months ago

i'll pick this one up!

hancush commented 5 months ago

mission log pt. 1:

i've built the db and have datasette up and running.

while the db was building locally, i just downloaded the live db from bunkum.us.

needed to install json-to-multicsv.pl.

brew install cpanm
git clone https://github.com/jsnell/json-to-multicsv.git && cd json-to-multicsv
perl -MCPAN -e'install Text::CSV'
# add perl variables to .zprofile
cpanm . -v -v # can probably do this w/ perl -MCPAN but i just copied how it's installed for testing on github

also needed to install datasette manually.

the query joins several tables against filing.rptId. sqlite automatically creates indexes for primary keys, and rptId is the primary key for filing, so i don't need to index the filing table.

the biggest tables joined against filing are:

expenditure
organization
activity
reportable_activity

i added indexes to the join fields (rptId for organization, activity, and reportable_activity; activity_id for expenditure) and confirmed that the query plan updated to use them.

to benchmark, i used the sqlite shell:

sqlite3 lm10.db 
// within sqlite shell
.timer ON
.output /dev/null
// run query/queries to measure time

interestingly, the timer was pretty much identical before and after adding indexes. i want to double check that i didn't make a goofy mistake, and potentially try indexing period_begin and period_through as they are used in the where clause.

after that, i should touch base with forest.

hancush commented 5 months ago

ok, i'm a little puzzled.

tl;dr - i've tried indexing the where expressions, and the join fields of the largest tables, but it is not doing anything to execution time. also, the query and download are running instantaneously in my local datasette (even after turning off http caching: datasette lm10.db --setting sql_time_limit_ms 100000 --setting default_cache_ttl 0 -o).

am i missing something, @fgregg?

code

here's the script i'm using to benchmark:

import sqlite3
import sys
import time

def create_connection(db_file):
    """Create a database connection to the SQLite database specified by db_file"""
    conn = None
    try:
        conn = sqlite3.connect(db_file)
        return conn
    except sqlite3.Error as e:
        print(e)
    return conn

def run_query(conn, query):
    """Execute the provided SQL query and measure execution time"""
    start_time = time.time()
    cursor = conn.cursor()
    cursor.execute(query)
    rows = cursor.fetchall()
    end_time = time.time()
    execution_time = end_time - start_time
    return rows, execution_time

def main():
    _, database, query = sys.argv

    # Create a database connection
    conn = create_connection(database)
    if conn is None:
        print("Error: Unable to connect to the database.")
        return

    # Run the query and measure execution time
    _, execution_time = run_query(conn, query)

    # Display query results and execution time
    print("Execution time: {:.6f} seconds".format(execution_time))

    # Close the database connection
    conn.close()

if __name__ == "__main__":
    main()

and the command i'm using to run the script:

python benchmark.py lm10.db "select
  *
from
  filing
  inner join filer using (srNum)
  left join lm10 using (rptId)
  left join other_address using (rptId)
  left join principal_officer using (rptID)
  left join reportable_activity using (rptId)
  left join reporting_employer using (rptID)
  left join signature using (rptId)
  left join organization using (rptId)
  left join activity using (rptID)
  left join counterparty_contact on counterparty_contact.activity_id = activity.id
  left join counterparty_organization on counterparty_organization.activity_id = activity.id
  left join expenditure on expenditure.activity_id = activity.id
where
  (
    strftime('%Y', period_begin) = '2023'
    OR strftime('%Y', period_through) = '2023'
  )"

table composition

quick note on how records are distributed between tables:

table	num records
organization	28744
expenditure	27248
reportable_activity	10926
activity	9842
counterparty_organization	9842
counterparty_contact	5212
signature	3642
lm10	1821
reporting_employer	1821
principal_officer	1389
other_address	353

benchmarks

before adding any indexes:

Execution time: 33.637214 seconds

after adding indexes to where expressions:

sqlite-utils query lm10.db "create index lm10_period_begin_year on lm10(strftime('%Y', period_begin))"
sqlite-utils query lm10.db "create index lm10_period_through_year on lm10(strftime('%Y', period_through))"

Execution time: 33.644461 seconds

after adding rptId indexes on organization, activity, reportable_activity (largest tables):

sqlite-utils create-index lm10.db organization rptId
sqlite-utils create-index lm10.db activity rptId
sqlite-utils create-index lm10.db reportable_activity rptId

Execution time: 33.807820 seconds

after adding activity_id indexes on expenditure, counterparty_organization, counterparty_contact:

sqlite-utils create-index lm10.db expenditure activity_id
sqlite-utils create-index lm10.db counterparty_contact activity_id
sqlite-utils create-index lm10.db counterparty_organization activity_id

Execution time: 33.870928 seconds

hancush commented 5 months ago

indexes didn't help, but handling the cartestian explosions did. i went with dumping multiple relations to json, since activities contain nested expenditures.

select
  *
from
  lm10
  inner join filing using (rptId)
  inner join filer using (srNum)
  left join other_address using (rptId)
  left join principal_officer using (rptID)
  left join (
    select
      rptId,
      json_group_array(
        json_object(
          'answer',
          answer,
          'code',
          code,
          'n_responses',
          n_responses,
          'question',
          question
        )
      ) as reportable_activities
    from
      reportable_activity
    group by
      rptId
  ) rep_act using (rptId)
  left join reporting_employer using (rptID)
  left join (
    select
      rptId,
      json_group_array(
        json_object(
          'on_date',
          on_date,
          'signed',
          signed,
          'telephone_number',
          telephone_number,
          'title',
          title
        )
      ) as signatures
    from
      signature
    group by
      rptId
  ) sig using (rptId)
  left join (
    select
      rptId,
      json_group_array(
        json_object(
          'promiseDate',
          promiseDate,
          'oID',
          oID,
          'empLabOrg',
          empLabOrg,
          'city',
          city,
          'state',
          state
        )
      ) as organizations
    from
      organization
    group by
      rptId
  ) org using (rptId)
  left join (
    select
      rptId,
      json_group_array(
        json_object(
          'id',
          id,
          '12b_exists',
          "12b_exists",
          'activity_code',
          activity_code,
          'activity_type',
          activity_type,
          'agencies',
          agencies,
          'federal_work',
          federal_work,
          'form_agreement',
          form_agreement,
          'no_uei_checkbox',
          no_uei_checkbox,
          'uei',
          uei,
          'unlisted_agencies',
          unlisted_agencies,
          'date_of_agreement',
          date_of_agreement,
          'counterparty_organization.city',
          counterparty_organization.city,
          'counterparty_organization.organization',
          counterparty_organization.organization,
          'counterparty_organization.po_box,_bldg,_room_no,_if_any',
          counterparty_organization."po_box,_bldg,_room_no,_if_any",
          'counterparty_organization.state',
          counterparty_organization.state,
          'counterparty_organization.street',
          counterparty_organization.street,
          'counterparty_organization.zip_code_+_4',
          counterparty_organization."zip_code_+_4",
          'counterparty_contact.city',
          counterparty_contact.city,
          'counterparty_contact.name',
          counterparty_contact.name,
          'counterparty_contact.po_box,_bldg,_room_no,_if_any',
          counterparty_contact."po_box,_bldg,_room_no,_if_any",
          'counterparty_contact.state',
          counterparty_contact.state,
          'counterparty_contact.street',
          counterparty_contact.street,
          'counterparty_contact.zip_code_+_4',
          counterparty_contact."zip_code_+_4",
          'expenditures',
          json(expenditures)
        )
      ) as activities
    from
      activity
      join (
        select
          activity_id,
          json_group_array(
            json_object(
              'date',
              date,
              'amount',
              amount,
              'kind',
              kind
            )
          ) as expenditures
        from
          expenditure
        group by
          activity_id
      ) exp on activity.id = exp.activity_id
      left join counterparty_contact on activity.id = counterparty_contact.activity_id
      left join counterparty_organization on activity.id = counterparty_organization.activity_id
    group by
      rptId
  ) act using (rptID)
where
  (
    (
      period_begin >= '2023-01-01'
      AND period_begin <= '2023-12-31'
    )
    OR (
      period_through >= '2023-01-01'
      AND period_through <= '2023-12-31'
    )
  )
order by
  lm10.rptId

local benchmark: Execution time: 0.157672 seconds

confirmed download works

labordata / lm10