lanterndata / lantern

PostgreSQL vector database extension for building AI applications
https://lantern.dev
GNU Affero General Public License v3.0
790 stars 56 forks source link

Add warnings when the index size exceeds work_mem #200

Closed ezra-varady closed 1 year ago

ezra-varady commented 1 year ago

This adds checks for the work_mem and maintenance_work_mem GUC variables. Checks for both are fairly uncommon in the upstream extensions. It seems like maintenance_work_mem is checked more often than work_mem this may be because it's greater size makes it less prohibitive to respect. Neither is actually enforced by postgres. I'll list some examples below

maintenance_work_mem is referenced more infrequently but its use is pretty straightforward when it is in src/backend/access/gin/gininsert.c in the build callback function maintenance_work_mem is checked

if (buildstate->accum.allocatedMemory >= (Size) maintenance_work_mem * 1024L)

similarly nbtree checks it in its parallel build functions src/backend/access/nbtree/nbtsort.c

 sortmem = maintenance_work_mem / btshared->scantuplesortstates;
 _bt_parallel_scan_and_sort(btspool, btspool2, btshared, sharedsort,
                                               sharedsort2, sortmem, false);

It is also checked in src/backend/commands/vacuumparallel.c. It is never checked in contrib. I think the bloom in the earlier listing refers to an internal bloomfilter, not the extension. Notably though pgvector does have a check

work_mem is also checked very infrequently although within the optimizer/executor there are a number of checks. in src/backend/access/gin/ginfast.c it gets checked

workMemory = work_mem;
...
/*
* Is it time to flush memory to disk?  Flush if we are at the end of
* the pending list, or if we have a full row and memory is getting
* full.
*/
if (GinPageGetOpaque(page)->rightlink == InvalidBlockNumber ||
    (GinPageHasFullRow(page) &&
    (accum.allocatedMemory >= workMemory * 1024L)))

it gets checked in src/backend/access/nbtree/nbtpage.c as well, albeit in a function that is only called during vacuums

maxbufsize = (work_mem * 1024L) / sizeof(BTPendingFSM);
maxbufsize = Min(maxbufsize, INT_MAX);
maxbufsize = Min(maxbufsize, MaxAllocSize / sizeof(BTPendingFSM));
/* Stay sane with small work_mem */
maxbufsize = Max(maxbufsize, vstate->bufsize);
vstate->maxbufsize = maxbufsize;

I have however found the following calling pattern in several places including contrib/tablefunc/tablefunc.c, contrib/dblink/dblink.c, contrib/adminpack/adminpack.c and also pgvector (albeit only in ivfscans)

tuplestore_begin_heap(random_access, false, work_mem);

this seems to be a data structure that holds tuples to be returned by a scan. It doesn't account for memory allocated elsewhere though. Overall the lack of enforcement seems to make checking these values somewhat uncommon. I think it makes sense to enforce maintenance_work_mem because building an index is relatively infrequent, but maybe enforcing runtime checks for work_mem is overkill

ezra-varady commented 1 year ago

Rebased onto main. Broke out code to check memory use into a helper function in utils.c. I added code to include memory allocated in the current context and it's children in these calculations for versions of postgres greater than 12. Before pg13 there's not a clear way to check how much memory has been allocated by a single context. The call site in the node retriever will get called a lot. It may be worth trying to optimize a bit around this

Ngalstyan4 commented 1 year ago

Looks good! moved to #205. Will change one test, check code coverage and merge from there