piskvorky / sqlitedict

Persistent dict, backed by sqlite3 and pickle, multithread-safe.
Apache License 2.0
1.17k stars 131 forks source link

ENH: make multithreading support optional #82

Closed abast closed 4 years ago

abast commented 6 years ago

In the special case that I just open the database, read some item and close it again, the multi-threading support causes significant overhead. I would therefore suggest to make it optional.

How could this be achieved in a way that requires only small code changes?

piskvorky commented 6 years ago

Can you be more specific? What is "significant overhead"? In what scenario is this an issue — what's your use-case for using sqlitedict? Cheers.

iapyeh commented 6 years ago

+Me (Or make the single thread optional). The Sqlitedict is excellent, I use it intensively in many projects, one of them is in a single thread scenario and I am sure there is no concurrent read-write. Because the Queue is used for transferring between threads (the Queue natively pickles and un-pickles data), maybe a single thread sqlitedict would benefit from removing Queue's overhead.

piskvorky commented 6 years ago

Can you provide some hard performance numbers? What kind of throughput are you seeing, what would be your expectation / goal, how much are you looking to shave off?

We'll have to think twice before changing the library structure to accommodate this use-case, because it's not aligned with the sqlitedict objectives (concurrent thread access). If you have just a single thread, why not simply use the built-in sqlite?

abast commented 6 years ago

I got sqlitedict for the "Persistent dict, backed by sqlite3 and pickle" feature, not for the "multithread-safe" feature. For the scenario that I have described (open connection, read small amount of data, close connection), the overhead due to multithreading seems to be ~40%.

I am currently using the same sqlitedict with several processes. Since I had data corruption issues, I serialize the access to the sqlitedict using a file based lock. In this scenario, I need to close the connection as fast as possible to allow other processes to access the database and the described scenario becomes the usual access pattern.

However, I agree that if it can not be done with a small code change and is anyway against the sqlite objectives, it is probably not worth the effort and I should look into other solutions.

iapyeh commented 6 years ago

I use the sqlitedict when I have a huge over-memory dictionary (say 30GB in a 16GB desktop) and it should be persistent. The sqlitedict is an excellent solution.

menshikh-iv commented 6 years ago

@abast it will be nice if you provide some benchmark code

piskvorky commented 6 years ago

@menshikh-iv @abast any estimates on how hard would support for this be, in terms of needed refactoring? I'd expect the main overhead to be the pickling (pickled dict keys, values), not threading. And pickling is needed both for serial and multithreading, right?

@abast have you tried the built-in sqlite3 or shelve modules?

menshikh-iv commented 6 years ago

@piskvorky seems than need to do several things:

BTW we anyway do encode/decode for sqlite (unrelated to multithreading), I think add single-thread connector can speedup, but not 40%.

Mradr commented 4 years ago

So while not 40% faster - out of 1000 runs from a few quick test I was able to pull around 20 seconds with the multithread and about 16 seconds for single threaded. That is at least a 20% performance boost and you don't have to worry about the extra overhead of the thread or some applications not working because "its thread lock".

piskvorky commented 4 years ago

Thanks for the benchmark @Mradr. IMO a ~20% (or even 40%) speed up is not worth the added complexity in code paths, so I'm closing this.

@Mradr @iapyeh @abast If such boost makes sense to you, please subclass SqliteDict or implement your own extension of Sqlite.

rayluo commented 3 years ago

I got sqlitedict for the "Persistent dict, backed by sqlite3 and pickle" feature, ...

I am currently using the same sqlitedict with several processes. Since I had data corruption issues, I serialize the access to the sqlitedict using a file based lock. In this scenario, I need to close the connection as fast as possible to allow other processes to access the database and the described scenario becomes the usual access pattern.

However, I agree that if it can not be done with a small code change and is anyway against the sqlite objectives, it is probably not worth the effort and I should look into other solutions.

@abast , I encounter a situation similar to what you described above, i.e. concurrency among multi-processes. What kind of solutions that you ended up with?