open-contracting / ocdskit

A suite of command-line tools for working with OCDS data
https://ocdskit.readthedocs.io
BSD 3-Clause "New" or "Revised" License
17 stars 6 forks source link

tabulate: remove empty tables and columns #174

Closed duncandewhurst closed 3 years ago

duncandewhurst commented 3 years ago

tabulate creates tables and columns for all fields in the schema, which makes it hard for users to understand scope and coverage.

I'm not sure how much of a priority tabulate is given that the registry will be available soon, but I wanted to make a note of the following Python script, which removes empty tables and columns from the database created by tabulate.

This script is for a SQLite database and uses the %sql line magic from ipython-sql:

# Get the table names from the database and put them in a Pandas dataframe
tables = %sql SELECT name FROM sqlite_master WHERE type = 'table';

for table in tables['name']:

  # Check if the table contains any rows
  table_check = %sql SELECT * FROM {table} limit 1;

  # Drop empty tables
  if len(table_check) == 0:

    %sql DROP TABLE {table};

  else:

    # Get the column names from the table and put them in a Pandas dataframe
    columns = %sql pragma table_info({table});

    non_empty_columns = []

    for column in columns['name']:

      # Check if the column contains any data
      column_check = %sql SELECT max({column}) as max from {table};

      if column_check['max'][0] != None:

        non_empty_columns.append(column)

        cols = ', '.join(non_empty_columns)

    # Recreate tables without empty columns (SQLite doesn't support DROP COLUMN)
    %sql CREATE TABLE {table}_new ({cols});
    %sql INSERT INTO {table}_new SELECT {cols} FROM {table};
    %sql DROP TABLE {table};
    %sql ALTER TABLE {table}_new RENAME TO {table};

Edit: This script could be optimized, either by using create view as select {{non_empty_cols}} from table, or by using one select statement per table to work out the empty fields, at the expense of more complex code to generate the select statement. However, for the purpose of a tutorial database with ~6,000 releases, this code was fast enough (3s).

jpmckinney commented 3 years ago

This command will be removed per #75.