document IDs and efficiency

vassudanagunta commented 1 year ago

I am indexing pages on a static website with about a thousand pages and 25k terms. Search results obviously need to include the page path. Which is likely to be more efficient?

use pagePath as the doc ID
use a number sequences as the doc ID, and set pagePath as a storedField

I normally would think that #2 might result in a more efficient internal index structure, but I noticed that you end up mapping an internal zero-based ID to the user-specified type any ID. Which makes me think #1 will be more efficient.

I already said this in another issue, but I want to repeat it:

btw, MiniSearch looks amazing, and I'm impressed by your replies to users, and the care I see in your documentation. These last two tipped the scales when I had to decide which search solution to try first :)

lucaong commented 1 year ago

Hi @vassudanagunta, thanks for the kind words :)

In principle, if the pagePath is uniquely identifying a document, you could follow approach 1. That said, approach 2 is also good, and differences in performance between the two approaches should be negligible.

As you noticed, internally, IDs are mapped to integers: this mapping is necessary to provide some features like the discard method, and also to enable some optimizations. Therefore, ultimately it does not really matter what you use as an ID, as long as it is uniquely identifying a document.

lucaong commented 1 year ago

@vassudanagunta I will go on and close the issue, but feel free to comment further if necessary.

lucaong / minisearch

document IDs and efficiency #215