pudo / dataset

Easy-to-use data handling for SQL data stores with support for implicit table creation, bulk loading, and transactions.
https://dataset.readthedocs.org/
MIT License
4.78k stars 298 forks source link

Why don't we use QueurPool for DB connections? #253

Closed kivo360 closed 6 years ago

kivo360 commented 6 years ago

I tried inserting a lot of records that would ignore a certain key. To speed the process I added everything to a multiprocess pool, with the hope that I could send the records in faster (insert many doesn't have an ignore option). However, I ran into problems with this. The database couldn't handle multiple connections.

I looked at the code and saw that the connection was handled using a staticpool. I was wondering why it didn't queue connections with a hard limit, instead of limiting the connections to one?

pudo commented 6 years ago

Damn, dataset has now been around long enough for the answer to be "we've always done it this way" - i.e. I don't recall why the static pool was originally instated. I assume it has something to do with schema modifications.

Are you up for cutting it out and doing some testing? If the standards work, we should definitely use them.

kivo360 commented 6 years ago

Not right now, deadline coming up, but I will. I want to be able to using multiprocessing for inserts. I do some pre-processing before inserting, and the way dataset is configured it's not letting me use multiple processes. I assume it's because there's a limit to the number of processes that are able to use the session.

It should be possible to put bound checks in to make that happen, but I'm not sure.

kivo360 commented 6 years ago

After quickly digging into the code. Turns out it does. My problem actually wasn't the querypool. It functions like it needs to. The issue is that I'm not able to dispose of connections after they're used. We just add onto connections over time and wait for the garbage collector to do.

I noticed that dataset doesn't use sessions, and opts to use the engine everywhere to make transactions. Initially, I was going to try using a scoped session to handle the large number of connections for my threadpool futures, but I don't see much of a way how.

Here's the scoped session. Do you know of any way we could implement this? https://stackoverflow.com/questions/6297404/multi-threaded-use-of-sqlalchemy

kivo360 commented 6 years ago

@pudo I looked at the reason and I see that it could be the fact that dataset doesn't use sessions. Instead it creates a connection on a local thread. Ultimately sharing the same engine creates a problem.

I have a few looming deadlines, and I can't do this right now, but at some point in the next few weeks I'm going to figure out how to fix this in the cleanest way possible. A decent part of our codebase runs on dataset so that would probably be a good idea.