jhamrick / dbtools

Tools for interfacing with SQLite databases
Other
34 stars 4 forks source link

To Scrape or Not to Scrape TABLE/VIEW Metadata? #16

Open JimCallahanOrlando opened 10 years ago

JimCallahanOrlando commented 10 years ago

CONCLUSION So, in good conscience, I cannot recommend methods (API functions or PRAGMAS) that are not subject to the rigorous SQLite3 testing process and not even guaranteed to be this release, let alone, future releases.

QUESTIONS/CONCERNS To Scrape or Not to Scrape TABLE/VIEW Metadata? Re-parsing SQLite SQL considered harmful. Is there a reliable API alternative?

ANALYSIS Currently dbtools obtains SQL TABLE metadata (column names, column types, primary key and whether the primary key auto-increments) by querying and scraping the internal sqlite_Master table for the "sql" string that created the table.

# SQL "cmd" string to obtain "sql" creation string from the sqlite_master table
    cmd = ("SELECT sql FROM sqlite_master "
           "WHERE tbl_name='%s' and type='table'" % name)

# Execute the SQL "cmd" string
    info = sql_execute(self.db, cmd, fetchall=True, verbose=verbose)

# parse the response -- it will look like 
# 'CREATE TABLE name(col1 TYPE, col2 TYPE, ...)'
    args = re.match("([^\(]*)\((.*)\)", info[0][0]).groups()[1]    

If dbtools is modified to accept VIEWs in place of TABLEs, the "sql" string for creating a VIEW can be much more convoluted, in fact the "sql" creation string for a VIEW is likely to be a CREATE VIEW AS SELECT ... statement that spells out a lot of ugly stuff specifically so it can be hidden from that point forward.

The four table meta data items are:

  1. column names
  2. column types
  3. primary key
  4. whether the primary key auto-increments

Scraping for #1, column names and #2, column types from the parsed version of the "sql" creation string (args =) above:

# get the column names [and types]
    cols = [a.strip() for a in args.split(",")]
    self.columns = tuple([x.split(" ")[0] for x in cols])

Scraping for #3, the primary key:

# parse primary key, if any
    pk = [bool(re.search(r"PRIMARY KEY", x)) for x in cols]
    primary_key = np.nonzero(pk)[0]
    if len(primary_key) > 1:
        raise ValueError("more than one primary key: %s" % primary_key)
    elif len(primary_key) == 1:
        self.primary_key = self.columns[primary_key[0]]
    else:
        self.primary_key = None

Scraping for #4 does the primary key auto-increment?

parse autoincrement, if applicable

ai = [bool(re.search(r"AUTOINCREMENT", x)) for x in cols]
autoincrement = np.nonzero(ai)[0]
if len(autoincrement) > 1:
    raise ValueError("more than one autoincrementing "
                "column: %s" % autoincrement)
elif self.primary_key is not None and len(autoincrement) == 1:
    if self.primary_key != self.columns[autoincrement[0]]:
        raise ValueError("autoincrement is different from primary key")
    self.autoincrement = True
else:
    self.autoincrement = False

The Python packages sqlite, apsw and sqlalchemy expose a portion of the SQLite3 C API functions. If the metadata is not exposed in the C API then the wrapper functions must be doing their own scraping and blindly trusting wrapper scraping is not an improvement over our own scraping. On the other hand the SQLite3 C API functions should go through the rigorous SQLite3 testing process. http://www.sqlite.org/testing.html

The SQLite3 C API Interface has functions for #1 column names and #2 column types.

Column Names In A Result Set

    const char *sqlite3_column_name(sqlite3_stmt*, int N);
    const void *sqlite3_column_name16(sqlite3_stmt*, int N);

These routines return the name assigned to a particular column 
in the result set of a SELECT statement.
...The name of a result column is the value of the "AS" clause for that column, 
if there is an AS clause. If there is no AS clause then the name of the column 
is unspecified and may change from one release of SQLite to the next.
http://www.sqlite.org/c3ref/column_name.html

Result Values From A Query
    ...
    int sqlite3_column_type(sqlite3_stmt*, int iCol);
    ...
The sqlite3_column_type() routine returns the datatype code for the initial 
data type of the result column. The returned value is one of SQLITE_INTEGER, 
SQLITE_FLOAT, SQLITE_TEXT, SQLITE_BLOB, or SQLITE_NULL. The value returned 
by sqlite3_column_type() is only meaningful if no type conversions have 
occurred as described below. After a type conversion, the value returned 
by sqlite3_column_type() is undefined. Future versions of SQLite may change 
the behavior of sqlite3_column_type() following a type conversion.
http://www.sqlite.org/c3ref/column_blob.html

This assumes we want the final aliased "AS" names in a result set; there is also in the SQLite3 C API: "a means to determine the database, table, and table column that is the origin of a particular result column in SELECT statement." http://www.sqlite.org/c3ref/column_database_name.html

The primary key and whether it is autoincremented is a bit more of a challenge. The SQLite3 C API has a function sqlite3_table_column_metadata() which for a specific database, table and column name combination returns data about that column:

Metadata is returned by writing to the memory locations passed 
as the 5th and subsequent parameters to this function. Any of these 
arguments may be NULL, in which case the corresponding element of 
metadata is omitted.

Parameter   Output
            Type            Description
---         -----------     -----------
5th         const char*     Data type
6th         const char*     Name of default collation sequence
7th         int             True if column has a NOT NULL constraint
8th         int             True if column is part of the PRIMARY KEY
9th         int             True if column is AUTOINCREMENT

So, this function would be terrific, if it was available, but it is not even guaranteed to be in the SQLite3 C API library, let alone in the Python apsw!

"This API is only available if the [SQLite3 C API] library was compiled 
with the SQLITE_ENABLE_COLUMN_METADATA C-preprocessor symbol defined."

Keith Medcalf suggested (in the SQLite3 mailing) using PRAGMAS.

There is an ideal PRAGMA for our purpose:

PRAGMA table_info(table-name);

This pragma returns one row for each column in the named table. 
Columns in the result set include the 
column name, 
data type, 
whether or not the column can be NULL, 
and the default value for the column. 
The "pk" column in the result set is zero 
for columns that are not part of the primary key, 
and is the index of the column in the primary key 
for columns that are part of the primary key.
http://www.sqlite.org/pragma.html#pragma_table_info

This PRAGMA has everything we need except whether the primary key is autoincrementing.

Unfortunately, "Specific pragma statements may be removed and others added in future releases of SQLite. There is no guarantee of backwards compatibility." http://www.sqlite.org/pragma.html

SQLite3 is a great system, but there is no standard method of obtaining column metadata in the base SQLite3 C API.

So, in good conscience, I cannot recommend methods (API functions or PRAGMAS) that are not subject to the rigorous SQLite3 testing process and not even guaranteed to be this release, let alone, future releases.