internetarchive / Zeno

State-of-the-art web crawler 🔱
GNU Affero General Public License v3.0
83 stars 11 forks source link

Reuse free space from popped items #111

Open equals215 opened 3 months ago

equals215 commented 3 months ago

The goal of this PR is to drastically slow down the growth of the queue by reusing disk space from popped items. This freeSpace index will use a lock-free size-specific slot array aka LSSA for common item sizes (to be determined via existing indexes analysis) and a stratified list for uncommon free space sizes. Also thinking of a defragmentation algorithm ̶a̶n̶d̶ ̶a̶ ̶w̶a̶y̶ ̶t̶o̶ ̶s̶t̶o̶r̶e̶ ̶f̶r̶e̶e̶S̶p̶a̶c̶e̶ ̶i̶n̶d̶e̶x̶ ̶o̶p̶e̶r̶a̶t̶i̶o̶n̶s̶ ̶i̶n̶t̶o̶ ̶t̶h̶e̶ ̶W̶A̶L̶.̶ ̶ <- the freeSpace index is derived from index adds and pops...

equals215 commented 3 months ago

@CorentinB throw out everything that comes to mind related to that matter. Any ideas, features, must-have/must-do, warnings. Everything.

CorentinB commented 3 months ago

Do we want to have an option to disable this? (in order to save some disk I/O when we know the crawl will be short and we don't care about saving some disk space while it runs)

equals215 commented 3 months ago

Do we want to have an option to disable this? (in order to save some disk I/O when we know the crawl will be short and we don't care about saving some disk space while it runs)

Fully in memory, gets dumped at the same time as the rest of the index and is tracked using the queue index WAL : you can derive queue index add/pop operations and make them freeSpace index operations.

So yeah we can make that optional but I mean, it's in-memory so no disk I/O related performance issues