phiresky / sqlite-zstd

Transparent dictionary-based row-level compression for SQLite
GNU Lesser General Public License v3.0
1.43k stars 49 forks source link

.load sqlite_zstd.dll results in Error: The specified procedure could not be found. #5

Closed voltagex closed 2 years ago

voltagex commented 2 years ago
C:\git\test-sqlite-data> sqlite3
SQLite version 3.37.1 2021-12-30 15:30:28
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
sqlite> .load sqlite_zstd.dll
Error: The specified procedure could not be found.

Built sqlite_zstd.dll via cargo build --release, using rustc 1.60.0 (7737e0b5c 2022-04-04)

sgmihai commented 2 years ago

I built it with the command you suggested (there is no info about compiling in the readme), and I get this error: .load /home/user//libsqlite_zstd.so Error: /home/user/libsqlite_zstd.so: undefined symbol: sqlite3_sqlitezstd_init

would be nice to get a version bump, as important fixes have been made since the last release.

phiresky commented 2 years ago

I didn't even realize there was a released version people might try to use, sorry. I've deleted "0.1.1". There's no stable release of this library (yet)

phiresky commented 2 years ago

I've published a new version (0.3.0) that's current and shouldn't have this issue

sgmihai commented 2 years ago

Thank you. I could now run this without issues: SELECT zstd_enable_transparent('{"table": "title_basics", "column": "titleType", "compression_level": 19, "dict_chooser": "''a''"}'), zstd_enable_transparent('{"table": "title_basics", "column": "primaryTitle", "compression_level": 19, "dict_chooser": "''a''"}'), zstd_enable_transparent('{"table": "title_basics", "column": "originalTitle", "compression_level": 19, "dict_chooser": "''a''"}'), zstd_enable_transparent('{"table": "title_basics", "column": "genres", "compression_level": 19, "dict_chooser": "''a''"}') after that I ran: SELECT dbsql.exec(sql("""SELECT zstd_incremental_maintenance(3600, 1)""")) and got this output:

[2022-07-17T14:58:49Z INFO sqlite_zstd::transparent] title_basics.titleType: Total 9074805 rows (74.25MB) to potentially compress. [2022-07-17T15:01:52Z INFO sqlite_zstd::transparent] Compressed 9074805 rows with dict_choice=a (dict_id=1). Total size of entries before: 74.25MB, afterwards: 119.45MB, (average: before=8B, after=13B) [2022-07-17T15:01:52Z INFO sqlite_zstd::transparent] Handled 9074805 / 9074805 rows (74.25MB / 74.25MB) [2022-07-17T15:01:55Z INFO sqlite_zstd::transparent] title_basics.primaryTitle: Total 9074805 rows (178.08MB) to potentially compress. [2022-07-17T15:04:32Z INFO sqlite_zstd::transparent] Handled 9074805 / 9074805 rows (178.08MB / 178.08MB) [2022-07-17T15:04:36Z INFO sqlite_zstd::transparent] title_basics.originalTitle: Total 9074805 rows (178.08MB) to potentially compress. [2022-07-17T15:07:13Z INFO sqlite_zstd::transparent] Handled 9074805 / 9074805 rows (178.08MB / 178.08MB) [2022-07-17T15:07:16Z INFO sqlite_zstd::transparent] title_basics.genres: Total 9074805 rows (98.20MB) to potentially compress. [2022-07-17T15:09:37Z INFO sqlite_zstd::transparent] Handled 9074805 / 9074805 rows (98.20MB / 98.20MB) [2022-07-17T15:09:37Z INFO sqlite_zstd::transparent] All maintenance work completed!

Database size is now 1.4GB vs 830 MB before. What am I missing ? I am trying to optimize for size this dataset: https://datasets.imdbws.com/title.basics.tsv.gz and others present there.

On Sun, Jul 17, 2022 at 1:38 PM phiresky @.***> wrote:

Closed #5 https://github.com/phiresky/sqlite-zstd/issues/5 as completed.

— Reply to this email directly, view it on GitHub https://github.com/phiresky/sqlite-zstd/issues/5#event-7007284425, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEVZPG5QMI24EPUWZVNNFBTVUPPDZANCNFSM5UN523PQ . You are receiving this because you commented.Message ID: @.***>

phiresky commented 2 years ago

Firstly, you need to run vacuum after doing the maintenance, otherwise sqlite will not free any actual disk space. Secondly, this library probably adds at least 5 bytes of overhead per row per compressed column. So it will only bring advantages for columns where the average row value length is somewhat long (e.g. longer than 20 bytes).

In your example (average: before=8B, after=13B) one of the columns is only 8 bytes in length which won't get any better by compressing them with the current version of zstd.

So some of that overhead comes from zstd not being optimal for extremely short inputs (<20 bytes), some of it is just the way I handle robustness with how the compression information is stored per row. So if you have a ton of rows but each row is pretty small, this library (in it's current state) won't get you anything, sorry. Try only enabling the compression on the largest columns.

sgmihai commented 2 years ago

That makes sense, thanks for clearing that up.

I tried the new lib.so you released, and only applied the compression to 3 of the columns that are worthy for compression. Problem is, it takes many hours just for one column (178MB). I see it does about 6MB of data per hour (8 core ryzen). This is kind of ridiculous, since I am planning on having this database embedded in the program I will ship (hopefully, one day). After one column was finished, and I run VACUUM; I got a 1GB file (vs 830MB original). - Maybe it will be better once all 3 columns are compressed ? In dbeaver it looks like this: https://i.imgur.com/b83TKUc.png https://i.imgur.com/Qa8LNL8.png Not sure how that value is stored internally, but the column being varchar, if it is a string and not a numerical value, then it's obvious why the file is bigger afterwards..

On Sun, Jul 17, 2022 at 9:53 PM phiresky @.***> wrote:

Firstly, you need to run vacuum after doing the maintenance, otherwise sqlite will not free any actual disk space. Secondly, this library probably adds at least 5 bytes of overhead per row per compressed column. So it will only bring advantages for columns where the average row value length is somewhat long (e.g. longer than 20 bytes).

In your example (average: before=8B, after=13B) one of the columns is only 8 bytes in length which won't get any better by compressing them with the current version of zstd.

So some of that overhead comes from zstd not being optimal for extremely short inputs (<20 bytes), some of it is just the way I handle robustness with how the compression information is stored per row. So if you have a ton of rows but each row is pretty small, this library (in it's current state) won't get you anything, sorry. Try only enabling the compression on the largest columns.

— Reply to this email directly, view it on GitHub https://github.com/phiresky/sqlite-zstd/issues/5#issuecomment-1186588831, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEVZPG7OXZHX4CJUUNSVPY3VURJCFANCNFSM5UN523PQ . You are receiving this because you commented.Message ID: @.***>

phiresky commented 2 years ago

I noticed that as well (that it takes much longer than it should)... Did the same happen for the older version or is this a regression? It should not be like this

sgmihai commented 2 years ago

No, older version was very zippy, was done in a few minutes most.

On Tue, Jul 19, 2022 at 7:43 PM phiresky @.***> wrote:

I noticed that as well (that it takes much longer than it should)... Did the same happen for the older version or is this a regression? It should not be like this

— Reply to this email directly, view it on GitHub https://github.com/phiresky/sqlite-zstd/issues/5#issuecomment-1189320488, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEVZPG65ODGCZA2ZR2LI44TVU3LMFANCNFSM5UN523PQ . You are receiving this because you commented.Message ID: @.***>

phiresky commented 2 years ago

Hey, I've released 0.3.1 which should fix the perf issue.

I also tried the table you linked with the following results:

All in all, only the primaryTitle column is worth compressing with the current version of sqlite-zstd. After compressing that, the database size for me shrinks from 816MiB to 770MiB. Which I'd classify as "not worth it".

It should in theory be possible to make the whole thing more efficient (e.g. the genre column is very redundant and should only take maybe 2 bytes to store) but I probably won't work on this kind of optimization in the near future.

sgmihai commented 2 years ago

Ok that fixed the speed, but I am still getting a 1.1GB file vs 830MB. Only modification I did to the db is convert "tt#number" ids to INT. Any ideas what I am still missing?

SELECT zstd_enable_transparent('{"table": "title_basics", "column": "primaryTitle", "compression_level": 19, "dict_chooser": "''a''"}'), zstd_enable_transparent('{"table": "title_basics", "column": "originalTitle", "compression_level": 19, "dict_chooser": "''a''"}'), zstd_enable_transparent('{"table": "title_basics", "column": "genres", "compression_level": 19, "dict_chooser": "''a''"}'); sqlite> SELECT zstd_incremental_maintenance(3600, 1);

[2022-07-19T20:03:27Z INFO sqlite_zstd::transparent] title_basics.primaryTitle: Total 9079718 rows (178.18MB) to potentially compress (split in 1 groups). [2022-07-19T20:06:57Z INFO sqlite_zstd::transparent] Compressed 9079718 rows with dict_choice=a (dict_id=1). Total size of entries before: 178.18MB, afterwards: 166.84MB, (average: before=19B, after=18B) [2022-07-19T20:06:57Z INFO sqlite_zstd::transparent] Handled 9079718 / 9079718 rows (178.18MB / 178.18MB) [2022-07-19T20:07:00Z INFO sqlite_zstd::transparent] title_basics.originalTitle: Total 9079718 rows (178.18MB) to potentially compress (split in 1 groups). [2022-07-19T20:09:21Z INFO sqlite_zstd::transparent] Handled 9079718 / 9079718 rows (178.18MB / 178.18MB) [2022-07-19T20:09:24Z INFO sqlite_zstd::transparent] title_basics.genres: Total 9079718 rows (98.25MB) to potentially compress (split in 1 groups). [2022-07-19T20:10:37Z INFO sqlite_zstd::transparent] Handled 9079718 / 9079718 rows (98.25MB / 98.25MB) [2022-07-19T20:10:37Z INFO sqlite_zstd::transparent] All maintenance work completed!

sqlite> VACUUM;

phiresky commented 2 years ago

compressing originalTitle and genres will increase the db size due to the added overhead. Only compressing primaryTitle will decrease the db size.