senaite / senaite.archive

Records archiving for SENAITE LIMS
GNU General Public License v2.0
2 stars 2 forks source link

Archiving an entry increases the database file size instead of decreasing it #1

Open eyahlin opened 2 years ago

eyahlin commented 2 years ago

Steps to reproduce

  1. Check the database file size (ls -l of Data.fs in var/filestorage)
  2. Archive an entry
  3. Check the database file size again. It has increased.

Current behavior

Archiving increases the database file size.

Expected behavior

Archiving should decrease the database file size, as stated in the "About" section of the README:

image

Screenshot (optional)

image

ramonski commented 2 years ago

Please make sure that you have packed your database before and after your test to ensure old transactions are removed.

eyahlin commented 2 years ago

Hello. I've just packed it in that screenshot. It's actually the first write transaction I've done after the database pack, and the add-on installation.

eyahlin commented 2 years ago

Any ideas why the filesize increased despite having packed the database?

xispa commented 2 years ago

How many objects (samples, worksheets, whatever) are we talking about? Maybe there are no enough objects stored/archived to see a significative difference

eyahlin commented 2 years ago

I am planning to archive at least 300 samples which are contained in at least 50 batches and at least 15 worksheets. This is just for the initial test data. When our retention period of 3 years kicks in it will be significantly more (at least 10 times more). This is the reason I'm curious about the file size increase. If I end up increasing the size by archiving wouldn't it do more harm than good to the performance?

eyahlin commented 2 years ago

Hello. What do you think?

xispa commented 2 years ago

Hello. What do you think?

I think that with that few records you won't see a significant difference.

eyahlin commented 2 years ago

I can create a copy of my production environment (Data.fs is 7GB large) right now to archive with more records. I could then provide you a screenshot of the file sizes before and after archiving. How many records is enough to see a "significant difference"?

Also, is it not an issue that the Data.fs file size has increased after archiving? I believe the behavior directly contradicts the description of senaite.archive.

xispa commented 2 years ago

I can create a copy of my production environment (Data.fs is 7GB large) right now to archive with more records. I could then provide you a screenshot of the file sizes before and after archiving. How many records is enough to see a "significant difference"?

Probably yes.

Also, is it not an issue that the Data.fs file size has increased after archiving? I believe the behavior directly contradicts the description of senaite.archive.

Some background first: SENAITE uses an object-oriented database (ZopeDB), that stores serialized objects. Direct searches against such database are not performant, cause system would need to deserialize and wake up every single object stored and then check if any of the values from searchable fields match with the search term. To overcome this, we make use of what is called "catalog", that stores data from objects as an SQL-like database. We can then perform searches against catalogs and we can wake up the objects with a match afterwards if we want it.

Archive creates a small object for each sample/worksheet/etc. before the object is definitely removed from the database. Archive also creates a catalog where metadata of these "small" objects is stored. This allows you to search for basic information from historic data. Besides, objects are removed only when they don't have other referenced objects. For instance, a worksheet will only be deleted after all its analyses are deleted.

As you can imagine, for a database with few objects, the overhead that comes with archive machinery may cause the database to increase rather than shrink. The number of objects required to see a "difference" depends on the size of each stored object (a sample with the field remarks filled weights more than a sample without remarks set) and the number of objects left without removal because they still keep references to other objects.

For further info, the archiving and removal of old objects takes place here: https://github.com/senaite/senaite.archive/blob/1.x/src/senaite/archive/utils.py#L169

Hope it helps

eyahlin commented 2 years ago

Thanks for the explanation. I understand better now how archive works.

From what I understood, in order for the archive to "work", the database must be sufficiently large enough with a lot of objects inside. My database in the screenshot is actually 7 GB now. It contains more than 31,000 samples now. Is this size still too small for the archiving to be worth it?

I ask this question because I just deployed senaite.archive in my production environment, and I want to know if I should enable it or not.