nuejs / nue

The Design Engineering Framework for the Web
https://nuejs.org
MIT License
6.14k stars 181 forks source link

`search` for documentation area #323

Open nobkd opened 3 months ago

nobkd commented 3 months ago

Add a feature to search through the pages of the created website. Maybe even just for Nue's documentation area (for now).

tipiirai commented 3 months ago

This is on the back of my mind. Similar to RSS feed. We should generate a search-index.json with h1, h1 + p, h2, h3, h4 information on it everytime the size is pushed to production. The search widget would use this data to find what user is looking for.

nobkd commented 3 months ago

I've seen somewhere where Minisearch was used: https://lucaong.github.io/minisearch/ Not sure if we would use it, but maybe it inspires. Would have to check it out myself, just remembered it, (probably also) because it is pretty small and has no dependencies.

JensRoland commented 2 days ago

Keeping in mind that Nue is pushing back against the "just fetch a massive bundle of data to the browser up front whether the user needs it or not" way of building web sites, I would argue that an in-memory client-side search is a poor fit.

The 'Nue Way(tm)' is more like:

So maybe the way to add search is to build an index, split it into chunks, and load the chunks on demand. I am using a very similar approach on a live web site and it's incredibly performant.

I'll add an example implementation below, but the real question is -- how could Nue support something like this? It requires a way to extract the Markdown content as text, a way to specify a script to be run on build time, and a way for this script to write its resulting files to the dist folder. Nue doesn't currently have the hooks for this, right?


Building the index: You'll want to generate two things, an article lookup file with titles and slugs:

{"articles":[
    {"t":"Understanding Nue","s":"/docs/how-it-works.html"},
    {"t":"Content-first approach in Nue","s":"/docs/content.html"}
]}

(note that we don't need an ID field since we can just use the array index)

and the search index itself. For simplicity, we will search words individually rather than n-grams, we'll have no fuzzy matching, we will search by prefix only, display titles only as our results, and only return a max of 10 results for any given term. That is:

Searching "pro" will return articles containing "progressive" but neither "pork" nor "improv"

Now, at build time, we will construct the article lookup as a list and the search index as a simple lookup table from search term to 10 article IDs (i.e. array index pointers). Pseudocode:

const articles = getArticles();  // Load the articles with slug, title, and content
let articleLookup = articles.map(({slug, title}) => ({title, slug}));
saveToFile('articleLookup.json', articleLookup); // saveToFile is a custom function

let searchIndex = Object.create(null);

// Helper: Lowercase, strip special chars, split by whitespace
const tokenize = text => text.toLowerCase().replace(/[^a-z0-9\s]/g, '').split(/\s+/).filter(Boolean);

let prefixCountPerArticle = Object.create(null);  // To calculate TF
for (const [articleId, { title, content }] of articles.entries()) {
    const words = tokenize(title + " " + content);
    let localPrefixCount = Object.create(null);

    for (let word of words) {
        // Index prefixes of 3..10 characters
        const maxPrefixLength = Math.min(word.length, 10);
        for (let i = 3; i <= maxPrefixLength; i++) {
            let prefix = word.slice(0, i);
            if (!localPrefixCount[prefix]) localPrefixCount[prefix] = 0;
            localPrefixCount[prefix] += 1; // Count term frequency in this article
        }
    }
    prefixCountPerArticle[articleId] = localPrefixCount;
}

// We now have the prefix counts per article for every prefix. This allows us to
// implement TF-IDF ranking of results.

// Calculate Inverse Document Frequency (IDF)
let docFrequency = Object.create(null);
for (const counts of Object.values(prefixCountPerArticle)) {
    for (const prefix of Object.keys(counts)) {
        if (!docFrequency[prefix]) docFrequency[prefix] = 0;
        docFrequency[prefix] += 1;
    }
}

// Calculate TF-IDF and build final search index
for (const [articleId, prefixCounts] of Object.entries(prefixCountPerArticle)) {
    for (const [prefix, tf] of Object.entries(prefixCounts)) {
        const idf = Math.log(articles.length / docFrequency[prefix]);
        const tfidf = tf * idf;

        if (!searchIndex[prefix]) searchIndex[prefix] = [];
        searchIndex[prefix].push({ articleId: parseInt(articleId), score: tfidf });
    }
}

// Keep only top 10 articles for each prefix based on TF-IDF score
for (const prefix in searchIndex) {
    searchIndex[prefix] = searchIndex[prefix]
        .sort((a, b) => b.score - a.score)  // Sort by score, descending
        .slice(0, 10)                       // Keep top 10
        .map(({ articleId }) => articleId); // Retain only article IDs
}

That's a good start. That allows us to create a 'live search' that updates as the user types their search term, beginning at 3 characters. We will fetch the articleLookup as soon as the search box gets focus, and the search index when they begin typing.

However, that index is too large to load on demand, and we don't really want to load the whole thing no matter what. So let's chunk it based on the first few characters of the search term. We want the chunks to have a reasonable size and be roughly equal in size; so let's take the three initial characters and hash them. I usually like the xxh3 hash for this, but it's wasteful to load xxh3 in the browser just for this, and we want a static solution, so we'll roll a basic hash function (it doesn't need to be cryptographically strong after all, it just needs to be small and fast). Here goes:

const NUM_CHUNKS = 50;
let chunks = Array.from({ length: NUM_CHUNKS }, () => Object.create(null));

// Helper: very simple hash using the first 3 characters of the input string
// and modulo for the chunk index
const basicHash = str => [...str.slice(0, 3)].reduce((hash, c) => (hash * 31 + c.charCodeAt(0)) % NUM_CHUNKS, 0);

// Split the search index into small chunks which can be fetched on-demand to the browser
for (const [prefix, articleIds] of Object.entries(searchIndex)) {
    const hashValue = basicHash(prefix);
    chunks[hashValue][prefix] = articleIds; // Add prefix entries to the chunk
}

// Serialize each chunk to a separate file
for (let i = 0; i < NUM_CHUNKS; i++) {
    const chunkData = JSON.stringify(chunks[i]);
    saveToFile(`searchIndex_chunk_${i}.json`, chunkData); // saveToFile is a custom function
}

Now, for the browser side of things, when the user begins typing in the search box, we simply call the following code:

const NUM_CHUNKS = 50;
// Helpers
const tokenize = text => text.toLowerCase().replace(/[^a-z0-9\s]/g, '').split(/\s+/).filter(Boolean);
const basicHash = str => [...str.slice(0, 3)].reduce((hash, c) => (hash * 31 + c.charCodeAt(0)) % NUM_CHUNKS, 0);

async function loadChunk(prefix) {
    const hashValue = basicHash(prefix);
    const chunkData = await fetch(`/path/to/searchIndex_chunk_${hashValue}.json`).then(res => res.json());
    return chunkData[prefix] || []; // Return articles for the prefix or empty if not found
}

// The search itself, call on keyUp in the search field, possibly with a debounce
async function liveSearch(input) {
    if (input.length < 3) return [];

    // Tokenize the input to get individual words
    const words = tokenize(input).filter(word => word.length >= 3);
    if (words.length === 0) return [];

    // Fetch results for each word and combine
    const wordResults = await Promise.all(words.map(async word => {
        const prefix = word.slice(0, 3);
        const articleIds = await loadChunk(prefix); // Fetch chunk for each word
        return new Set(articleIds);
    }));

    // Intersect results to get articles that match all words
    let combinedResults = wordResults.reduce((acc, ids) => {
        if (acc === null) return ids; // Initialize with the first word's results
        return new Set([...acc].filter(id => ids.has(id))); // Intersection of sets
    }, null);

    // Lookup articles in articleLookup for display
    const results = Array.from(combinedResults || []).map(id => articleLookup[id]);
    return results;
}

All that is left then, is to display the results.