Open dmos62 opened 1 year ago
Setup profiling in this commit in above PR.
To profile, import a csv file with a lot of columns through the UI, notice that a profile file ending in a UTC timestamp is created in project's root, then open it with something like snakeviz
.
Attaching zipped profile file for a ~90 second inference: profile.zip
Was able to reduce inference time by about half (from 95s to 37s), by disabling defaults-related code in db/columns/operations/alter.py::alter_column_type
.
The code seems to be unnecessary at least in some cases (hopefully including when importing a table), because a method further up the call stack already filters columns down to non-default columns: see db/tables/operations/infer_types.py::update_table_column_types
.
If that's true, we can refactor to only trigger the defaults-related logic when necessary.
Most of these algorithms don't have explanations, so I might have completely misunderstood the intent. @mathemancer could you sanity check and summarize the intent of default-related logic in alter_column_type
(for my immediate purposes, but I'll also make docstrings)?
N.B. There are two combinations of default
and column
in our codebase:
id
primary key column.Columns with default values: This is a column which will provide a value for a given record if one isn't submitted. Unfortunately, this includes the id
column, as it provides a next sequence value for each record.
In case it wasn't clear, the default
in the alter_column_type
logic is referring to (2), whereas the is_default
in the update_column_types
is referring to (1).
From a user perspective, I think your instinct is correct that we don't need to consider column default values when doing type inference at this time. The reason is that the front end only calls the type suggestion endpoint in the context of an initial import, and the only column involved with a default(2) value is the default(1) column, id
. This is the column that's filtered out from inference earlier in the stack, since we (of course) don't need to change the type of the default id
column (ever).
The problem from an API design, however, is that the suggest_types
endpoint can't really guarantee any of that. I.e., we'd just crash into errors whenever trying to change the type of a column that has a somehow incompatible default value or default value generator.
Big picture: This is (yet another) issue borne from SQLAlchemy. Specifically, SQLAlchemy is not great (terrible, actually) at handling default values and default value generators for table columns, necessitating slow and cumbersome logic to:
Split this into subtasks following a sync with Brent.
This issue has not been updated in 90 days and is being marked as stale.
Root issue: #2346
Our column type inference seems to be slower than it could be. Look for ways to improve its performance.
Profiling results
Profiled with a 381kB 539 line CSV file whose first two lines look like this:
Type inference took 95 seconds.
We spent 50% of time in column defaults-related logic (
get_column_defaults
,set_column_defaults
). 75% in reflection (reflect_table
). 25% executing casting calls (execute_statement
). This adds up to >100%, because reflection overlaps with a lot of things.Ideas