Closed kivo360 closed 6 years ago
Damn, dataset
has now been around long enough for the answer to be "we've always done it this way" - i.e. I don't recall why the static pool was originally instated. I assume it has something to do with schema modifications.
Are you up for cutting it out and doing some testing? If the standards work, we should definitely use them.
Not right now, deadline coming up, but I will. I want to be able to using multiprocessing for inserts. I do some pre-processing before inserting, and the way dataset is configured it's not letting me use multiple processes. I assume it's because there's a limit to the number of processes that are able to use the session.
It should be possible to put bound checks in to make that happen, but I'm not sure.
After quickly digging into the code. Turns out it does. My problem actually wasn't the querypool. It functions like it needs to. The issue is that I'm not able to dispose of connections after they're used. We just add onto connections over time and wait for the garbage collector to do.
I noticed that dataset doesn't use sessions, and opts to use the engine everywhere to make transactions. Initially, I was going to try using a scoped session to handle the large number of connections for my threadpool futures, but I don't see much of a way how.
Here's the scoped session. Do you know of any way we could implement this? https://stackoverflow.com/questions/6297404/multi-threaded-use-of-sqlalchemy
@pudo I looked at the reason and I see that it could be the fact that dataset doesn't use sessions. Instead it creates a connection on a local thread. Ultimately sharing the same engine creates a problem.
I have a few looming deadlines, and I can't do this right now, but at some point in the next few weeks I'm going to figure out how to fix this in the cleanest way possible. A decent part of our codebase runs on dataset so that would probably be a good idea.
I tried inserting a lot of records that would ignore a certain key. To speed the process I added everything to a multiprocess pool, with the hope that I could send the records in faster (insert many doesn't have an ignore option). However, I ran into problems with this. The database couldn't handle multiple connections.
I looked at the code and saw that the connection was handled using a staticpool. I was wondering why it didn't queue connections with a hard limit, instead of limiting the connections to one?