moj-analytical-services / etl_manager

A python package to create a database on the platform using our moj data warehousing framework
21 stars 9 forks source link

Need way of updating db rather than creating from scratch #89

Closed RobinL closed 4 years ago

RobinL commented 5 years ago

If we have two repos that contribute to the same database, then it's hard to include tables from both repos.

For instance, if we have two repositories that both create tables in the open_data database (e.g. one repo that ETLs ONS data to the platform, and another than ETLs travel time data), then you can add either the ONS data or the travel time data to the glue catalogue, but you can't easily have both.

RobinL commented 4 years ago

How do we feel about the following API?

from etl_manager.meta import get_existing_database_from_glue_catalogue

# Note I am not going to attempt to read current tables from Glue and create table objects
db = get_existing_database_from_glue_catalogue('my_database')

t = TableMeta(name="table1", location="somewhere")
t.add_column(name= "employee_id2", type= "character", description= "a new description")

db.add_table(t)

# Will not replace existing tables unless overwrite is set to true
db.append_tables_to_glue_database(overwrite=False)
isichei commented 4 years ago

Yeah fine by me. On top of that this should be used to fix #117. I'd imagine that you could have something like:

    def create_glue_database(self, delete_if_exists=False):
        """
        Creates a database in Glue based on the database object calling the method function.
        By default, will error out if database exists - unless delete_if_exists is set to True (default is False).
        """

        if delete_if_exists:
            self.delete_glue_database()

         db = get_existing_database_from_glue_catalogue(self.name)
         if db:
             existing_tables = db._tables
         else:
             db = {"DatabaseInput": {"Description": self.description, "Name": self.name}}
             _glue_client.create_database(**db)
            existing_tables = []

        for tab in [t for t in self._tables if t not in existing_tables]:
            glue_table_def = tab.glue_table_definition(self.s3_database_path)
            _glue_client.create_table(DatabaseName=self.name, TableInput=glue_table_def)

There are some issues with the above (indenting probably for one). But we would probably want to parameterise the function to only update new tables, set a list of tables to update or do all of them. Anyway thought I'd add this as it will define what is returned from get_existing_database_from_glue_catalogue