Revamp parsing for more relevant results

Right now the vector search has an issue where it matches small results for queries even if those results aren't matching.

This is a function of the current parsing algorithm, which breaks page into chunks of 1000 tokens with the remaining tokens being the next chunk. So if a page is 1010 tokens, the 1st chunk would be 1000 tokens and the second 10 tokens. at scale, this means that there are a lot of chunks with trivial amounts of data.

As a potential solution to this, we could revamp the page chunking so that it doesn't include any small chunks. a basic algorithm for this could look like:

let N = num tokens on a page

N < 100 --> don't index. too small
100 <= N < 1000 -> N tokens per chunk
1000 <= N < 2000 -> N/2 tokens per chunk
2000 <= N < 3000 -> N/3 tokens per chunk
3000 <= N < 4000 -> N/4 tokens per chunk
...

with this chunking strategy, each chunk has at least 100 tokens. and for content from longer pages where N >= 1000, N is at least N / (Math.floor(N/1000) + 1).

obviously we'd need to validate some behavior here, but i think it's a hypothesis worth testing.

mongodben / mongodb-oracle

Revamp parsing for more relevant results #46