Delegate details of connection handling

This connection management is ugly, and forces the calling code to know about connections. This should be delegated and encapsulated in some other object which can handle connection setup and re-use, and can handle connection types other than psycopg2. This is slightly complicated because the Source knows where it wants to get the data from, but this has to be pre-configured for connection sharing. Perhaps this would work best with a ConnectionManager class which collected the sources and introspected them? Or cached connections for re-use?

My first thought is to add a level of abstraction. We can grow a source that only provides this ability, namely to manage the connection information, and just pass that down to the actual source implementations. For example:

from raw_tiles.source.table_reader import TableReader

class DatabaseSource(object):

    def __init__(self, conn_ctx, table_sources):
        self.conn_ctx = conn_ctx
        self.table_sources = table_sources

    def __call__(self, tile):
        all_source_locations = []
        all_timing = {}

        # grab connection
        with self.conn_ctx() as conn:
            # commit transaction
            with conn as conn:
                # cleanup cursor resources
                with conn.cursor() as cur:
                    for source in self.table_sources:
                        table_reader = TableReader(cur)
                        source_locations, timing = source(table_reader, tile)
                        all_source_locations.extend(source_locations)
                        all_timing.update(timing)

        return all_source_locations, all_timing

This gives us a layer where we can toy with threads or working with the connection pool for how we want to dole out the connections to each source, eg whether they all run in the same transaction, share a connection, or each get separate ones, and it prevents upstream from having to know about table_readers and having that concern bubble up the call stack.

In our case this can replace the multisource implementation, because all of our implementations are sql based and talk to the same database, and this can serve the same kind of purpose for grouping sources together. I'd probably just update the factory function that creates this to be aware of that just to keep the configuration part simple, and that still gives us a hook to easily change our minds if things change in the future.

What bothers me about joining up the MultiSource and database connection handling is that they feel like they should be separate things. I feel like the data sources should have uniform interfaces, so that it doesn't matter if something is a MultiSource or OsmSource or FutureHttpApiBasedSource, etc...

What do you think about keeping MultiSource agnostic about the connection type and introducing a ConnectionManager object (with-able) which manages the connections (at the moment, singular connection), which is passed into every Source.__call__? For example:

with ConnectionManager(all_connections_config) as connmgr:
  # enumerate tiles, etc...
  source(connmgr, tile)

Then, within the *Source itself, it could choose what type of connection it wants to use:

def __call__(connmgr, tile):
  with connmgr.database_cursor() as cur:
    # do stuff with cur

How does that sound to you?

What bothers me about joining up the MultiSource and database connection handling is that they feel like they should be separate things. I feel like the data sources should have uniform interfaces, so that it doesn't matter if something is a MultiSource or OsmSource or FutureHttpApiBasedSource, etc...

I agree here, and was only suggesting that we effectively get rid of the multisource, and just have a single databasesource, with each of the others being table sources, only reason being that we wouldn't need it any more. If we find ourselves having multiple other sources we can always introduce it and have both, a multisource and a separate databasesource. Or we could just keep it around and not use it.

I'm not bothered by these having separate signatures, because nobody is going to be interfacing with the table sources directly from a higher level. And if that's ever required, it just needs to be wrapped in a databasesource, which isn't too cumbersome, and makes sense conceptually as well, ie to talk to a table you need to go through a database first. The existing sources are effectively mapped to tables anyway.

Considering the future, if other sources are added, it means that they'll need to get a connmgr, even if they don't need one, which just feels like the details of one implementation have leaked into the api. Continuing with that reasoning, it could also imply that the connmgr might need to grow additional concerns for new implementations, and then we'd need to be aware of all the other implementation's usage of it any time changes are made to it.

What do you think about keeping MultiSource agnostic about the connection type and introducing a ConnectionManager object (with-able) which manages the connections (at the moment, singular connection), which is passed into every Source.call? For example:

This will certainly work in our case for managing connections, but I don't think it's as optimal as moving the concern in a layer above. In effect, it's pulling what it needs, instead of having it pushed to it, which means it knows about more than it might have to otherwise. Additionally, if we find that we want to run each source in a separate thread, we would still need something to coordinate that at a higher level anyway, and then the context manager as a whole needs to be thread safe because calls are being made into it from the sources.

I want to point out that I'm playing devil's advocate and this is largely academic, because at this point we only have a single type of implementation that's single threaded. But it's fun to bike shed! So, if the conn mgr still makes more sense to you, go for it! :)

tilezen / raw_tiles

Delegate details of connection handling #18