Open zerebubuth opened 6 years ago
My first thought is to add a level of abstraction. We can grow a source that only provides this ability, namely to manage the connection information, and just pass that down to the actual source implementations. For example:
from raw_tiles.source.table_reader import TableReader
class DatabaseSource(object):
def __init__(self, conn_ctx, table_sources):
self.conn_ctx = conn_ctx
self.table_sources = table_sources
def __call__(self, tile):
all_source_locations = []
all_timing = {}
# grab connection
with self.conn_ctx() as conn:
# commit transaction
with conn as conn:
# cleanup cursor resources
with conn.cursor() as cur:
for source in self.table_sources:
table_reader = TableReader(cur)
source_locations, timing = source(table_reader, tile)
all_source_locations.extend(source_locations)
all_timing.update(timing)
return all_source_locations, all_timing
This gives us a layer where we can toy with threads or working with the connection pool for how we want to dole out the connections to each source, eg whether they all run in the same transaction, share a connection, or each get separate ones, and it prevents upstream from having to know about table_reader
s and having that concern bubble up the call stack.
In our case this can replace the multisource
implementation, because all of our implementations are sql based and talk to the same database, and this can serve the same kind of purpose for grouping sources together. I'd probably just update the factory function that creates this to be aware of that just to keep the configuration part simple, and that still gives us a hook to easily change our minds if things change in the future.
What bothers me about joining up the MultiSource
and database connection handling is that they feel like they should be separate things. I feel like the data sources should have uniform interfaces, so that it doesn't matter if something is a MultiSource
or OsmSource
or FutureHttpApiBasedSource
, etc...
What do you think about keeping MultiSource
agnostic about the connection type and introducing a ConnectionManager
object (with
-able) which manages the connections (at the moment, singular connection), which is passed into every Source.__call__
? For example:
with ConnectionManager(all_connections_config) as connmgr:
# enumerate tiles, etc...
source(connmgr, tile)
Then, within the *Source
itself, it could choose what type of connection it wants to use:
def __call__(connmgr, tile):
with connmgr.database_cursor() as cur:
# do stuff with cur
How does that sound to you?
What bothers me about joining up the MultiSource and database connection handling is that they feel like they should be separate things. I feel like the data sources should have uniform interfaces, so that it doesn't matter if something is a MultiSource or OsmSource or FutureHttpApiBasedSource, etc...
I agree here, and was only suggesting that we effectively get rid of the multisource, and just have a single databasesource, with each of the others being table sources, only reason being that we wouldn't need it any more. If we find ourselves having multiple other sources we can always introduce it and have both, a multisource and a separate databasesource. Or we could just keep it around and not use it.
I'm not bothered by these having separate signatures, because nobody is going to be interfacing with the table sources directly from a higher level. And if that's ever required, it just needs to be wrapped in a databasesource, which isn't too cumbersome, and makes sense conceptually as well, ie to talk to a table you need to go through a database first. The existing sources are effectively mapped to tables anyway.
Considering the future, if other sources are added, it means that they'll need to get a connmgr, even if they don't need one, which just feels like the details of one implementation have leaked into the api. Continuing with that reasoning, it could also imply that the connmgr might need to grow additional concerns for new implementations, and then we'd need to be aware of all the other implementation's usage of it any time changes are made to it.
What do you think about keeping MultiSource agnostic about the connection type and introducing a ConnectionManager object (with-able) which manages the connections (at the moment, singular connection), which is passed into every Source.call? For example:
This will certainly work in our case for managing connections, but I don't think it's as optimal as moving the concern in a layer above. In effect, it's pulling what it needs, instead of having it pushed to it, which means it knows about more than it might have to otherwise. Additionally, if we find that we want to run each source in a separate thread, we would still need something to coordinate that at a higher level anyway, and then the context manager as a whole needs to be thread safe because calls are being made into it from the sources.
I want to point out that I'm playing devil's advocate and this is largely academic, because at this point we only have a single type of implementation that's single threaded. But it's fun to bike shed! So, if the conn mgr still makes more sense to you, go for it! :)
This connection management is ugly, and forces the calling code to know about connections. This should be delegated and encapsulated in some other object which can handle connection setup and re-use, and can handle connection types other than
psycopg2
. This is slightly complicated because theSource
knows where it wants to get the data from, but this has to be pre-configured for connection sharing. Perhaps this would work best with aConnectionManager
class which collected the sources and introspected them? Or cached connections for re-use?