h2database / h2database

H2 is an embeddable RDBMS written in Java.
https://h2database.com
Other
4.23k stars 1.2k forks source link

Trying to understand housekeeping mechanism - poc provided #4096

Open fmarot opened 4 months ago

fmarot commented 4 months ago

Hello, I used to use a very old H2 version where LOBs where stored alongside the DB on the filesystem. This was good because when removing the data, the files were deleted at once and disk space was freed immediately. Now that I use latest H2 version LOBs are stored inside the main file. And I never ever see this DB file being shrunk: even after deleting many data from it. The DB just keep growing.

I developed a sample Proof Of Concept in order to showcase the problem. Maybe I do something wrong in my code and hopefully somebody will point it to me :) The POC is located here: https://github.com/fmarot/h2-housekeeping-poc

You can run it with: mvn clean package && java -cp "target/h2-housekeeping-poc-0.1-SNAPSHOT.jar;target/lib/*" com.teamtter.h2.poc.Main ./target/DB

It show that creating randomly large binaries in the DB and randomly deleting them makes the DB file grow. Even after pausing any reader or writer for 15 minutes when the DB file is twice larger that the payload, no shrinking of the DB file is observed.

Any explanation or hint to try is welcome.

manticore-projects commented 4 months ago

Greetings!

Thank you for this isolated test case, it's definitely worth to keep and run it for every version of H2. I can also see that you pull latest release already, which is good. Maybe consider testing against latest GIT sources too (I myself understand H2 very much as a rolling release).

And maybe add this parameter to your connection URL AUTO_COMPACT_FILL_RATE=90. It works nicely for me on 10+ GByte databases (although not LOB heavy.)

manticore-projects commented 4 months ago

One more thing: Maybe define "clean Payload" as DB Size after DEFRAG SHUTDOWN and I would expected the running DB size after housecleaning is a bit above that "clean Payload" (maybe factor 1.5x).

Real life example with lots of ETL and query/updates and indices (although no lobs):

(For me and my use case this work only reliable after the 2.2.224 release.)

fmarot commented 4 months ago

Hello thanks for the feedback, it was more or less what I understood from reading the issues on the subject. For me I prefer to use "clean payload" to refer to the raw content of the files/LOBs in the DB. As I put mostly LOBs, the clean content is roughly the size of the LOBs in the DB (those not deleted).

To be more precise on my interrogations: I really do not care if H2 takes 1.5x space to store the LOBs. I do not care if the size of the multiple files put in the DB amounts to X and H2 uses 2*X space. What bothers me is that the space occupied by now-deleted LOBs seems never to be reclaimed so the file grows indefinitely.

COMPACT SHUTDOWN is the only thing that works but it requires app downtime that I try to minimize. regards