Make column type inference faster

dmos62 commented 1 year ago

Root issue: #2346

Our column type inference seems to be slower than it could be. Look for ways to improve its performance.

Profiling results

2023-01-26_16-12

Profiled with a 381kB 539 line CSV file whose first two lines look like this:

Direction,Year,Date,Weekday,Country,Commodity,Transport_Mode,Measure,Value,Cumulative,Direction2,Year2,Date2,Weekday2,Country2,Commodity2,Transport_Mode2,Measure2,Value2,Cumulative2,Direction2_2,Year2_2,Date2_2,Weekday2_2,Country2_2,Commodity2_2,Transport_Mode2_2,Measure2_2,Value2_2,Cumulative2_2,Direction22,Year22,Date22,Weekday22,Country22,Commodity22,Transport_Mode22,Measure22,Value22,Cumulative22,Direction2_3,Year2_3,Date2_3,Weekday2_3,Country2_3,Commodity2_3,Transport_Mode2_3,Measure2_3,Value2_3,Cumulative2_3,Direction22_2,Year22_2,Date22_2,Weekday22_2,Country22_2,Commodity22_2,Transport_Mode22_2,Measure22_2,Value22_2,Cumulative22_2,Direction2_4,Year2_4,Date2_4,Weekday2_4,Country2_4,Commodity2_4,Transport_Mode2_4,Measure2_4,Value2_4,Cumulative2_4,Direction22_3,Year22_3,Date22_3,Weekday22_3,Country22_3,Commodity22_3,Transport_Mode22_3,Measure22_3,Value22_3,Cumulative22_3,Direction2_5,Year2_5,Date2_5,Weekday2_5,Country2_5,Commodity2_5,Transport_Mode2_5,Measure2_5,Value2_5,Cumulative2_5,Direction22_4,Year22_4,Date22_4,Weekday22_4,Country22_4,Commodity22_4,Transport_Mode22_4,Measure22_4,Value22_4,Cumulative22_4
Exports,2015,01/01/2015,2004-01-01,All,All,All,$,104000000,104000000,Exports,2015,01/01/2015,2004-01-01,All,All,All,$,104000000,104000000,Exports,2015,01/01/2015,2004-01-01,All,All,All,$,104000000,104000000,Exports,2015,01/01/2015,2004-01-01,All,All,All,$,104000000,104000000,Exports,2015,01/01/2015,2004-01-01,All,All,All,$,104000000,104000000,Exports,2015,01/01/2015,2004-01-01,All,All,All,$,104000000,104000000,Exports,2015,01/01/2015,2004-01-01,All,All,All,$,104000000,104000000,Exports,2015,01/01/2015,2004-01-01,All,All,All,$,104000000,104000000,Exports,2015,01/01/2015,2004-01-01,All,All,All,$,104000000,104000000,Exports,2015,01/01/2015,2004-01-01,All,All,All,$,104000000,104000000

Type inference took 95 seconds.

We spent 50% of time in column defaults-related logic (get_column_defaults, set_column_defaults). 75% in reflection (reflect_table). 25% executing casting calls (execute_statement). This adds up to >100%, because reflection overlaps with a lot of things.

Ideas

[x] #2370
- Gives 50% speed improvement, but only suitable during import from CSV (when columns don't have defaults)
[ ] #2371
- Should reduce number of loop iterations (don't have an estimate, but should be noticable)
- Relatively simple to implement
[ ] #2372
- Should give at least 75% speed improvement, because defaults and reflection are not needed
[ ] #2373
- Should reduce the 25% spent executing casting calls

dmos62 commented 1 year ago

Setup profiling in this commit in above PR.

To profile, import a csv file with a lot of columns through the UI, notice that a profile file ending in a UTC timestamp is created in project's root, then open it with something like snakeviz.

Attaching zipped profile file for a ~90 second inference: profile.zip

dmos62 commented 1 year ago

Was able to reduce inference time by about half (from 95s to 37s), by disabling defaults-related code in db/columns/operations/alter.py::alter_column_type.

The code seems to be unnecessary at least in some cases (hopefully including when importing a table), because a method further up the call stack already filters columns down to non-default columns: see db/tables/operations/infer_types.py::update_table_column_types.

If that's true, we can refactor to only trigger the defaults-related logic when necessary.

Most of these algorithms don't have explanations, so I might have completely misunderstood the intent. @mathemancer could you sanity check and summarize the intent of default-related logic in alter_column_type (for my immediate purposes, but I'll also make docstrings)?

mathemancer commented 1 year ago

N.B. There are two combinations of default and column in our codebase:

Default Columns: This is a column that we'll always add to any Mathesar-created table. Currently, that's only the id primary key column.
Columns with default values: This is a column which will provide a value for a given record if one isn't submitted. Unfortunately, this includes the id column, as it provides a next sequence value for each record.

In case it wasn't clear, the default in the alter_column_type logic is referring to (2), whereas the is_default in the update_column_types is referring to (1).

From a user perspective, I think your instinct is correct that we don't need to consider column default values when doing type inference at this time. The reason is that the front end only calls the type suggestion endpoint in the context of an initial import, and the only column involved with a default(2) value is the default(1) column, id. This is the column that's filtered out from inference earlier in the stack, since we (of course) don't need to change the type of the default id column (ever).

The problem from an API design, however, is that the suggest_types endpoint can't really guarantee any of that. I.e., we'd just crash into errors whenever trying to change the type of a column that has a somehow incompatible default value or default value generator.

Big picture: This is (yet another) issue borne from SQLAlchemy. Specifically, SQLAlchemy is not great (terrible, actually) at handling default values and default value generators for table columns, necessitating slow and cumbersome logic to:

Test whether a column has a default value
Test whether the default value is dynamic (i.e., generated in some sequence) or static.
Get the default value
Set an analogous new default value appropriate to the column's new type when altering.

dmos62 commented 1 year ago

Split this into subtasks following a sync with Brent.

github-actions[bot] commented 1 year ago

This issue has not been updated in 90 days and is being marked as stale.

mathesar-foundation / mathesar

Make column type inference faster #2347

Profiling results

Ideas