preaction / Statocles

Static website CMS
http://preaction.me/statocles
Other
84 stars 33 forks source link

Add a Search application #393

Open dertuxmalwieder opened 8 years ago

dertuxmalwieder commented 8 years ago

Hi,

(sorry, this one will be longish, but I want to make my points clear…,)

having been in the process of transitioning from WordPress to some static (not just flat-file) blog for years now (I’m really lazy), I still haven’t settled to which system to use. Actually, I had found one which I thought would be perfect, then I noticed that the provided solution for searching articles was not working as intended, especially since there was no way to use it without JavaScript. The most important feature of a blog system is a good search functionality, followed by a decent comment solution (but that’s a different thing).

So I’m back on track, looking for the perfect static blog solution. I already have a list of such systems which failed to work well for me (mostly theming- or feature-related issues), so I loosened my requirements a bit. I don’t even care which programming language is used anymore as long as it just works (as in it provides a good search function) and it’s a cool one (as in it’s not JavaScript).

Now here’s what I want:

The perfect static blog solution should, while generating the pages, keep some full-text index of the posts and provide a search function which could be accessed through the front-end (like the article listing but filtered by contents). In case this is already possible, please tell me how - I actually searched the docs and sources but I haven’t found such a functionality.

As I want you to actually consider this - I know - frequent feature wish within the near future, I posted it to several interesting generators’ issue trackers, including yours. I’ll probably use the Static Site Generator which comes up with a sufficient search functionality first.

Thank you in advance.

preaction commented 8 years ago

I've been thinking about building an in-browser search tool in JavaScript. Indeed it wouldn't be difficult. It's not trivial, but it's pretty straightforward with clear solutions to the issues that will arise. The main issue will be the size of the index, since the browser will likely have to download the whole thing. There could be some possible solutions to that, including a prefix tree, to shard the index so only the necessary parts are downloaded. But that requires the browser to load the index and perform the search, which requires JavaScript.

Having a static site with a search that does not require JavaScript or some other programming platform in the browser is not a static site anymore. That's likely why nobody has been able to provide the exact feature in the exact way you want: You're asking for something impossible. I've been working on a sibling project, https://github.com/preaction/Dynamocles, to provide some dynamic companion to a static site, but that still makes it a dynamic part of the site. The only other option is to use some search engine and limit the query to your site (like Google "site:example.com" feature).

Frankly, I consider search to be a failure of navigation. Since the user's already using Google, and Google can't find the right page in your site, it's also a failure to provide good content. The existing search engines is a big reason why most sites do not need their own internal search function.

Finally, your method of requesting this feature (in whatever form) is rude. Free software is largely a volunteer effort: I work on it when I want to work on it. If you're interested in the feature sooner than I get to it, you can work on it yourself, and your work will be added to the project for future users to appreciate. Exhorting the developers of a bunch of projects into a race, with a prize of having to deal with more demands from you (after all, you're "really lazy"), doesn't sound like a good incentive. Indeed, if I win, it seems like I will only be rewarded with more trouble for myself and my contributors.

Likely Statocles will get a search. It will likely require the browser to have and enable JavaScript. I will not race to develop a feature.

dertuxmalwieder commented 8 years ago

I did not want to look rude, I just was cheeky here. :-)

The main issue will be the size of the index, since the browser will likely have to download the whole thing.

Why would a server-side index with a Perl search function (similar to Sphinx) not work?

I agree with you that usually searching a site is a "Google failure", but I don't think it's a valid excuse to leave the visitor without a helping hand when Google (or your preferred search engine) fails to lead him to the correct page again. (And it often does.)

I don't "demand" many things, I'm even willing to contribute the missing functionality myself if no one else is. I just wondered why this has not been requested before...

preaction commented 8 years ago

I did not want to look rude, I just was cheeky here. :-)

I'd avoid that. Since you do not know me, and I don't know you, you cannot be sure of my reaction to it.

The main issue will be the size of the index, since the browser will likely have to download the whole thing.

Why would a server-side index with a Perl search function (similar to Sphinx) not work?

If you mean "Sphinx" the search engine, that's not static. If you mean "Sphinx" the Python documentation system, the search in that uses JavaScript in exactly the way I explained the solution. (see: http://stackoverflow.com/questions/605888/whats-the-search-engine-used-in-the-new-python-documentation)

dertuxmalwieder commented 8 years ago

Sorry then!

Does the index have to be the whole thing? Common phrases could be left out, for example; also, depending on your formatting, a "plain text index" could still be reasonably smaller than your website data. You'd only have to send it to clients when they want to start a search anyway, right?

preaction commented 8 years ago

Well, yes. You wouldn't index the stopwords (MySQL has a good list of stopwords to use, and it'd likely be a feature to add/remove stopwords from the list). You wouldn't index phrases, you'd use the index to match phrases (though you likely couldn't get features like "exact phrase match" unless you added those phrases to the index). The Apache Lucene project is a good reference for this kind of database.

The indexes are only needed when you're searching, and browser caching means that they likely only need to download the indices once in a while (the SO post about Sphinx goes into that).

So you'd end up with a "database" of JSON files that look like this, with the keys being terms, and the values being an array of pages where that term appears:

# /search/index.json
{
    "antipode": [
        "/blog/2015/01/01/chrono-trigger.html",
        "/blog/2015/03/05/new-spells.html",
        "/help/spells/antipode.html"
    ],
    "blast": [
        "/blog/2015/03/05/new-spells.html",
        "/help/spells/blast.html"
    ]
}

And if someone searched for antipode blast, you'd show them the /blog/2015/03/05/new-spells.html first (since it has appears in both the antipode and blast arrays), followed by the rest of the matches from the antipode or blast arrays.

To make the results better, you could count the number of times a term appears in the page to help score the results:

# /search/index.json
{
    "antipode": [ 
        [ "/blog/2015/01/01/chrono-trigger.html", 2 ],
        [ "/blog/2015/03/05/new-spells.html", 3 ],
        [ "/help/spells/antipode.html", 12 ]
    ],
    "blast": [
        [ "/blog/2015/03/05/new-spells.html", 5 ],
        [ "/help/spells/blast.html", 17 ]
    ]
}

So now you can use the scores to rank the results returned. If I wanted both antipode blast, I should return /blog/2015/03/05/new-spells.html again. But if I only wanted blast, I should return /help/spells/blast.html first, as it has a higher score.

There is lots of stuff that can be done here, but the general idea is that all the information is stored in the index. Likely that means your index will need to also store the things you want to show on the search results, like the title of the page, the description, and maybe an excerpt that matches the term.

Finally, as I alluded to, you could organize your database into multiple files like so:

# /search/index/a.json
{
    "antipode": [ 
        [ "/blog/2015/01/01/chrono-trigger.html", 2 ],
        [ "/blog/2015/03/05/new-spells.html", 3 ],
        [ "/help/spells/antipode.html", 12 ]
    ]
}
# /search/index/b.json
{
    "blast": [
        [ "/blog/2015/03/05/new-spells.html", 5 ],
        [ "/help/spells/blast.html", 17 ]
    ]
}

So now if I just search for blast, the search engine knows it only needs to load the b.json index file, reducing the amount of data it needs to fetch.

It doesn't have to be JSON, you could make something more compact, but since the browser will be optimized to parse JSON, that's the likely best choice for now.

So it'd be up to the Statocles search application to build the index, deploy the index with the rest of the site, and serve the JavaScript file that knows how to use the index. Then it's the browser's job to execute the JavaScript, which loads the indices it needs, and shows the user the search results.

Now you've got a search engine. There's a whole lot more you could do, richer data structures, more query options, optimizations, but the basic functionality is there, completely static and in the browser. If you wanted users to be able to restrict results by path, for example, you could post-process the list of results to restrict to only those matching the desired path. If that ends up with no results, you could remove one of the terms and show the user the results from that search (which Google has taken do doing if a search doesn't match enough things). All this is possible without a dynamic server (though there are going to be performance implications, indeed only recently has the general level of JavaScript performance been high enough to do fun things like this, but now we've also got more numerous slower mobile devices to consider).