Open fgregg opened 5 months ago
i'll pick this one up!
mission log pt. 1:
i've built the db and have datasette up and running.
while the db was building locally, i just downloaded the live db from bunkum.us.
needed to install json-to-multicsv.pl.
brew install cpanm
git clone https://github.com/jsnell/json-to-multicsv.git && cd json-to-multicsv
perl -MCPAN -e'install Text::CSV'
# add perl variables to .zprofile
cpanm . -v -v # can probably do this w/ perl -MCPAN but i just copied how it's installed for testing on github
also needed to install datasette manually.
the query joins several tables against filing.rptId. sqlite automatically creates indexes for primary keys, and rptId is the primary key for filing, so i don't need to index the filing table.
the biggest tables joined against filing are:
i added indexes to the join fields (rptId for organization, activity, and reportable_activity; activity_id for expenditure) and confirmed that the query plan updated to use them.
to benchmark, i used the sqlite shell:
sqlite3 lm10.db
// within sqlite shell
.timer ON
.output /dev/null
// run query/queries to measure time
interestingly, the timer was pretty much identical before and after adding indexes. i want to double check that i didn't make a goofy mistake, and potentially try indexing period_begin and period_through as they are used in the where clause.
after that, i should touch base with forest.
ok, i'm a little puzzled.
tl;dr - i've tried indexing the where expressions, and the join fields of the largest tables, but it is not doing anything to execution time. also, the query and download are running instantaneously in my local datasette (even after turning off http caching: datasette lm10.db --setting sql_time_limit_ms 100000 --setting default_cache_ttl 0 -o
).
am i missing something, @fgregg?
here's the script i'm using to benchmark:
import sqlite3
import sys
import time
def create_connection(db_file):
"""Create a database connection to the SQLite database specified by db_file"""
conn = None
try:
conn = sqlite3.connect(db_file)
return conn
except sqlite3.Error as e:
print(e)
return conn
def run_query(conn, query):
"""Execute the provided SQL query and measure execution time"""
start_time = time.time()
cursor = conn.cursor()
cursor.execute(query)
rows = cursor.fetchall()
end_time = time.time()
execution_time = end_time - start_time
return rows, execution_time
def main():
_, database, query = sys.argv
# Create a database connection
conn = create_connection(database)
if conn is None:
print("Error: Unable to connect to the database.")
return
# Run the query and measure execution time
_, execution_time = run_query(conn, query)
# Display query results and execution time
print("Execution time: {:.6f} seconds".format(execution_time))
# Close the database connection
conn.close()
if __name__ == "__main__":
main()
and the command i'm using to run the script:
python benchmark.py lm10.db "select
*
from
filing
inner join filer using (srNum)
left join lm10 using (rptId)
left join other_address using (rptId)
left join principal_officer using (rptID)
left join reportable_activity using (rptId)
left join reporting_employer using (rptID)
left join signature using (rptId)
left join organization using (rptId)
left join activity using (rptID)
left join counterparty_contact on counterparty_contact.activity_id = activity.id
left join counterparty_organization on counterparty_organization.activity_id = activity.id
left join expenditure on expenditure.activity_id = activity.id
where
(
strftime('%Y', period_begin) = '2023'
OR strftime('%Y', period_through) = '2023'
)"
quick note on how records are distributed between tables:
table | num records |
---|---|
organization | 28744 |
expenditure | 27248 |
reportable_activity | 10926 |
activity | 9842 |
counterparty_organization | 9842 |
counterparty_contact | 5212 |
signature | 3642 |
lm10 | 1821 |
reporting_employer | 1821 |
principal_officer | 1389 |
other_address | 353 |
Execution time: 33.637214 seconds
sqlite-utils query lm10.db "create index lm10_period_begin_year on lm10(strftime('%Y', period_begin))"
sqlite-utils query lm10.db "create index lm10_period_through_year on lm10(strftime('%Y', period_through))"
Execution time: 33.644461 seconds
sqlite-utils create-index lm10.db organization rptId
sqlite-utils create-index lm10.db activity rptId
sqlite-utils create-index lm10.db reportable_activity rptId
Execution time: 33.807820 seconds
sqlite-utils create-index lm10.db expenditure activity_id
sqlite-utils create-index lm10.db counterparty_contact activity_id
sqlite-utils create-index lm10.db counterparty_organization activity_id
Execution time: 33.870928 seconds
indexes didn't help, but handling the cartestian explosions did. i went with dumping multiple relations to json, since activities contain nested expenditures.
select
*
from
lm10
inner join filing using (rptId)
inner join filer using (srNum)
left join other_address using (rptId)
left join principal_officer using (rptID)
left join (
select
rptId,
json_group_array(
json_object(
'answer',
answer,
'code',
code,
'n_responses',
n_responses,
'question',
question
)
) as reportable_activities
from
reportable_activity
group by
rptId
) rep_act using (rptId)
left join reporting_employer using (rptID)
left join (
select
rptId,
json_group_array(
json_object(
'on_date',
on_date,
'signed',
signed,
'telephone_number',
telephone_number,
'title',
title
)
) as signatures
from
signature
group by
rptId
) sig using (rptId)
left join (
select
rptId,
json_group_array(
json_object(
'promiseDate',
promiseDate,
'oID',
oID,
'empLabOrg',
empLabOrg,
'city',
city,
'state',
state
)
) as organizations
from
organization
group by
rptId
) org using (rptId)
left join (
select
rptId,
json_group_array(
json_object(
'id',
id,
'12b_exists',
"12b_exists",
'activity_code',
activity_code,
'activity_type',
activity_type,
'agencies',
agencies,
'federal_work',
federal_work,
'form_agreement',
form_agreement,
'no_uei_checkbox',
no_uei_checkbox,
'uei',
uei,
'unlisted_agencies',
unlisted_agencies,
'date_of_agreement',
date_of_agreement,
'counterparty_organization.city',
counterparty_organization.city,
'counterparty_organization.organization',
counterparty_organization.organization,
'counterparty_organization.po_box,_bldg,_room_no,_if_any',
counterparty_organization."po_box,_bldg,_room_no,_if_any",
'counterparty_organization.state',
counterparty_organization.state,
'counterparty_organization.street',
counterparty_organization.street,
'counterparty_organization.zip_code_+_4',
counterparty_organization."zip_code_+_4",
'counterparty_contact.city',
counterparty_contact.city,
'counterparty_contact.name',
counterparty_contact.name,
'counterparty_contact.po_box,_bldg,_room_no,_if_any',
counterparty_contact."po_box,_bldg,_room_no,_if_any",
'counterparty_contact.state',
counterparty_contact.state,
'counterparty_contact.street',
counterparty_contact.street,
'counterparty_contact.zip_code_+_4',
counterparty_contact."zip_code_+_4",
'expenditures',
json(expenditures)
)
) as activities
from
activity
join (
select
activity_id,
json_group_array(
json_object(
'date',
date,
'amount',
amount,
'kind',
kind
)
) as expenditures
from
expenditure
group by
activity_id
) exp on activity.id = exp.activity_id
left join counterparty_contact on activity.id = counterparty_contact.activity_id
left join counterparty_organization on activity.id = counterparty_organization.activity_id
group by
rptId
) act using (rptID)
where
(
(
period_begin >= '2023-01-01'
AND period_begin <= '2023-12-31'
)
OR (
period_through >= '2023-01-01'
AND period_through <= '2023-12-31'
)
)
order by
lm10.rptId
local benchmark: Execution time: 0.157672 seconds
confirmed download works
The csv download&_size=max&_dl=1) for this query) is timing out.
Here's the explain query plan.
Let's add some indexes with sqlite-utils to improve the query performance.
--setting sql_time_limit_ms 100000